[This Transcript is Unedited]
DEPARTMENT OF HEALTH AND HUMAN SERVICES
NATIONAL COMMITTEE ON VITAL AND HEALTH STATISTICS
SUBCOMMITTEE ON POPULATIONS
WORKSHOP ON DATA LINKAGES TO IMPROVE HEALTH OUTCOMES
September 19, 2006
999 9th Street, NW
CASET Associates, Ltd.
10201 Lee Highway
Fairfax, Virginia 22030
TABLE OF CONTENTS
- Call to Order
- IRS, Education and Veterans Administration
- Maximizing the Benefits from Linked Data: Access for Research and Related Issues
- A Broader Perspective on the Role of Linkages
P R O C E E D I N G S (9:02 a.m.)
DR. STEINWACHS: I’d like to welcome everyone to the second day of our
Workshop on Data Linkages to Improve Health Outcomes. This workshop has been
put together by the Subcommittee on Population Health of the National Committee
on Vital and Health Statistics, and this workshop is being broadcast live on
the Internet. So I would like to welcome those people who are listening on the
And I thought before we got started officially, we might just go around the
room and introduce ourselves, for those who are on the Internet.
And I am Don Steinwachs. I chair the Subcommittee on Population Health, and
I am from Johns Hopkins University.
MR. IAMS: I am Howard Iams from the Social Security Administration Research
MS. MADANS: Jennifer Madans from the National Center for Health Statistics.
MR. HARRIS-KOJETIN: Brian Harris-Kojetin from the Office of Management and
MR. BJORKLUND: Rick Bjorklund, Office of the Assistant Deputy Under
Secretary for Health, the Veterans Health Administration.
MR. PETSKA: I am Tom Petska. I am Director of the Statistics of Income
Division of the Internal Revenue Service.
MR. CHAPMAN: Chris Chapman, U.S. Department of Education, National Center
for Education Statistics.
MS. OBENSKI: Sally Obenski, Data Integration Division, U.S. Census Bureau.
MR. PREVOST: Ron Prevost, Data Integration Division, U.S. Census Bureau.
DR. DAVERN: Michael Davern, University of Minnesota.
DR. STEUERLE: Gene Steuerle from the Urban Institute and a member of the
MR. LOCALIO: Russell Localio, University of Pennsylvania, School of
Medicine, a member of the committee.
DR. SCANLON: Bill Scanlon, Health Policy R&D and a member of the
(Additional intros around the room.)
DR. STEINWACHS: It is a pleasure to welcome everyone today.
Our first session has speakers from Internal Revenue Service, Department of
Education, Department of Veterans Affairs, and Russell Localio, who is a member
of the committee, is going to serve as facilitator.
MR. LOCALIO: Good morning, everyone.
I just want to introduce our first speaker, Tom Petska, our friend from the
Internal Revenue Service.
PARTICIPANT: Is that true? Oh, of course that is true. Of course.
MR. PETSKA: I have no comment on that.
Actually, I am very pleased to be here to speak on this topic, Workshop on
Data Linkages to Improve Health Outcomes. A lot of people might think what is
the IRS role in that? I hope I can say a few things in the next 10 or 15
minutes about that.
In a way I feel a little bit inadequate about speaking to this group for a
One is that my division is centrally located between IRS and Treasury, and
that gives me two bosses, the Director of Research Analysis and Statistics of
IRS and the Director of Tax Analysis of Treasury, and they can – they sometimes
agree – put it that way – on what we should be doing and how we should be doing
things. So that is a little awkward.
Also, because I am Director of SOI, I get occasional questions as to, Can
you tell me about Section 8267 of the Code and how that effects my small oil
and gas operation?
And the bottom line is the Internal Revenue Code is 2,000 pages long. The
supporting regs are over 10,000 pages, at last count, and I really don’t have
that encyclopedic knowledge. I am sorry.
But I do get questions like, What is the average amount of charitable
contributions for my level of income? And that is something I can look up, even
though I can’t look up what is the status of your refund.
So, that said, hopefully, I can tell you a little bit about IRS tax data
administrative statistics as a potential source for shedding some light on
health outcomes and so on.
Before I say anything further, I would like to add my disclaimer that these
are my personal views and not necessarily those of the Internal Revenue Service
or the Department of Treasury.
A little bit about my organization. IRS is a large organization, 100,000
employees, an $11-billion budget.
The SOI program is about .4 percent of that, $40 million, which sounds like
a lot of money, but, relatively speaking, it is small, and under 500 employees.
Our primary customers – We have two very intensive customers and those
being the Office of Tax Analysis of the Treasury Department and the
Congressional Joint Committee on Taxation.
I’ll be talking a little bit about data access and disclosure, and let me
just say that those organizations do have full access to all of our data, and
they have a heavy role in directing our priorities and studies.
However, we do have many other customers, many in the federal statistics
community, including the Bureau of Economic Analysis and the Census Bureau.
Okay. What kind of data do we have?
Well, I probably should have given a little bit more background to this
slide. I am going to be talking a little bit about two types of data. One are
the sample data from my organization. We sample tax returns and we have
scientifically-designed samples. We edit these samples very carefully, and we
weight these to national totals.
The content is based on our user needs, and so if the Treasury Department
comes back and says, We want multiple schedules on depreciation by different
types of class lives, we can’t add that. It is a resource issue, but it is not
a policy issue or program issue.
Now, separate from that is the relatively content poor and less edited data
in the IRS Master File system, and we don’t produce that data, but we are kind
of a gateway to other federal statistical agencies and researchers who do have
access for that.
It is also the main source of statistics at the sub-national level, because
our SOI samples are not robust for most states and certainly not below the
Okay. So those are the types of data we have.
Now, in each of these program areas – individual, corporate, partnership,
estate and gift tax, tax exempt or non-profit organizations – we have pretty
much these two sources, and, for the most part, the content-rich SOI samples
are the preferred source, just because the data are of higher quality as well
as the content much greater.
And just as a footnote, I should add that all the data that is filed on all
tax returns and schedules is not transcribed, and that is one reason for our
program is that I think there are something like nearly 100 schedules that an
individual can append to their 1040 return, and only a limited amount of that
data is transcribed. So, as far as content, we have that flexibility, although,
as I said, it is a resource issue, and, again, all of our data are preaudit.
Well, two viewpoints on SOI. Well, one is that we try to be a cooperative,
collaborative and efficient producer and user of data based on administrative
records, and I think we do a pretty good job on that, for the most part, but,
then, on the other hand, we are the – quote – tax collectors in disguise, as
survey statisticians, and I’ll talk about what that means a little bit further
down the road, but, first and foremost, we are employees of the Internal
Revenue Service, and that has certain legal issues in terms of what we do and
in terms of our relationships with others as well.
Now, we have kind of generically three opportunities for data linkages,
first being linkages involving solely IRS tax and information returns. Others
are linkages involving tax data to surveyed records, and that is a very short
topic, which I’ll tell you why, and, then, lastly, I suspected that the kind of
tone or focus of the conference would be what about microdata access by
researchers and other agencies and so on, and so I thought I would spend some
time on that, and I have brought my disclosure expert, Nick Greenia, who is
sitting in the back there, who works with me and my boss, Mark Mazor(?), on a
lot of interagency issues involving data access.
Okay. If you gave me the time, I could go on and on about this first one,
linkages involving tax and information returns, for which we have control over
the data. We have access to the data and so on, and first might be individual
returns linking in 1099s and W2s.
For instance, you see an individual return. It’s got a certain income,
$75,000. It is a joint return. There’s a husband and wife, apparently. Is it
from one income or two incomes? Well, you can’t determine that without looking
at such things as the W2s and so on. So we do linkages like that to look at the
whole picture of family economic income.
Partnerships. Partnerships are a major type of business, but they are
untaxed. People or organizations, corporations, non-profits form partnerships.
They report their financial activities on an information return, and then they
distribute their taxable shares to the partners, and those could be expenses.
They could be income and so on.
So to ascertain the total effect of partnership taxation, you have to link
in partner tax returns, and one partnership could have 20,000 partners. So it
is not a trivial matter.
Small business corporations are similar. They have the flow-through nature
of a partnership.
The estate tax has become a hot political topic once again, and among the
questions are, What happens to these bequests from large estates? The estate
return shows who gets them, but it doesn’t show the income and what happens to
those individuals over time. We can link them up by matching their 1040 returns
and following them over time on a panel basis.
There are a lot of issues in regard to consolidated corporations.
Subsidiaries may file separately. To get a combined picture of corporations,
you have to link these together, and we use the employer identification number
to do that.
And then another focus of our primary customers, Treasury and the Joint
Committee, are for panel files linking the same entity – corporation,
individual, et cetera – over time or linking within a year, and we do quite a
bit of that kind of work, which we can talk about as time allows.
Okay. What do these files have?
Well, first of all, high-quality linking variable. If you don’t have that –
We don’t do a lot of research on using other things like names and addresses to
link. If we don’t have a high-quality variable, we probably don’t have the
resources to do a high-quality study.
Fortunately, most of our studies, most of our files, the variable, the
employer identification number or Social Security number is very accurate.
Obviously, if you didn’t have overlapping samples and you try to link
records, you are going to get a few hits because of that. So we need at least
one population file or samples that are substantially overlapping.
And, lastly, what about accounting periods? We do – at a lot of our linked
studies, we want to align accounting periods. We want to take the partner and
the partnership or the corporation and its employment return and we want to be
very specific about aligning those accounting periods, or, in other cases, we
want to show a panel-like study with different periods.
Okay. Well what have we learned from all these studies over time?
Well, a few things. First of all – and I think that has been the tone of
day one of the conference is that matched files can be very rich analytically
and so on, but, on the other hand, linking data files is never probably as easy
as it seems to be. Data quality is never optimal. Even in our SOI files linking
variables, even though our Social Security numbers and EINs are high quality,
but there are times when, in some cases, that they are not as high quality as
we would like and we have non-matching and so on.
Resolving these discrepancies, if they are important – and I think they are
– often is a very labor-intensive effort, and so we try to avoid that to the
extent possible, but we don’t want to produce a file that is based solely on
matches and ignore the non-matches. I think that would be a mistake, and linked
files sometimes don’t answer all the questions, and we could talk about that
But I think the one key thing is that you take this data, if you develop
high-quality linking variables and you put it in a relational database, I think
you are way ahead of the game. Okay?
Well, this, I said, was going to be a short topic, linkages of tax data to
survey records, and it is short for two reasons. One, we do very few surveys
ourselves. The only surveys we do are for corporate returns, particularly
As you know, just as individuals can request filing extensions,
corporations can, too, and most of them routinely do, and so if we have an
early data cutoff and we need to get that corporate return in, that major U.S.
corporation, that sample at a weight of one, we often do a survey and request
preliminary data from them, and so – or, in some cases with multinational
corporations, they’re just not completed that accurately, the presumption being
that everybody provides IRS accurate and complete data, but that is not always
the case, and these multinational returns are very, very complex and sometimes,
even despite their best efforts, they are not complete showing all their
international activities and so on.
But then the last point is we don’t do matches to other survey data,
because, again, getting back to my point earlier, we are the tax guys, so
people don’t provide us microdata. I mean, that is the other hat we wear. We
are the IRS guys and the perception would be that could these data be brought
into a compliance type situation.
Okay. Provisional microdata to other agencies. Who can get what? Well, this
is a very, very short summary and so on, but, first of all, tax administration.
By this I mean, for the most part, members of the IRS, and sometimes Treasury
as well and so on, who have a tax administration motivation. Taxpayer account
processing, audit, compliance and research functions are all internal and they
all can get access to most of these. Although, we have to be very careful in
cases where there are sample data involved and could it destroy the
accurateness and representatives of these samples.
Tax analysis. Treasury’s Office of Tax Analysis, the Congressional Joint
Committee, CBO and GAO all have roles in tax analysis and can get data, not 100
percent, but, for the most part, can get some identifiable tax data for
specific purposes as articulated in the Internal Revenue Code.
And, then, statistical use. The Bureau of Economic Analysis can get
corporate data. Census gets population data for individuals and for businesses.
When the Ag census was moved to the Department of Agriculture, NASS, a few
years ago, that data access was also enabled with them, and the CBO as well.
We talked yesterday – and I think the Census people presented this very
well – that their goal, their mandate is to use existing data system to
the maximum extent possible, such as administrative records.
Our mandate, unfortunately, is the opposite, though. It is provide only the
federal tax information for authorized purposes and to the minimum extent
necessary. So this is just a naturally conflicting mandate that we have.
Constraints on using IRS tax data. Well, first of all, it has to be for a
use for authorized purposes, and this is defined in the Internal Revenue Code
and in supporting regulations, the 10,000 or 12,000 pages that I mentioned
earlier, and can be also further defined in separate documents, MOUs, such as
the Census-IRS criteria agreement.
Again, our mandate is to disclose only the minimum confidential federal tax
information necessary, and there are substantial penalties for unauthorized
disclosure or inspection, and publicly released data must be anonymous.
Although, we do have a public-use file. As time allows, we can talk about that,
but we do remove identifiers and sanitize records in a subsample of our
individual program, and it is used by a number of high-profile policy analyst
Okay. Briefly, the authorization process for access to tax data.
Well, first of all statistical recipients – and this came up yesterday –
need to be cited in Section 6103(j) of Title 6 (sic) of the Internal Revenue
To change that, Congress must enact legislation.
The statute authorizes access purpose and may stipulate supporting
regulations and so on. The regs – regulation detail may restrict uses as well.
And, as I said before, policy agreements can provide additional
In summary, access to tax data is very restricted. Some possibilities
include – and these are very limited – working as a contractor for tax
administration purposes. We have had a few of these, but they are very limited.
Working at an agency with current access, like the Treasury or like the CBO or
the Joint Committee, or accessing limited business data via Census’ Center for
And to find out more, we can talk later or drop me an email or give me a
MR. LOCALIO: Thank you, Tom.
Do we have any questions?
MS. TUREK: (Off mike).
MR. PETSKA: Survey on Consumer Finances.
MS. TUREK: Yes. They use you as a sampling frame. They don’t get any data
MR. PETSKA: I worked on that study several years ago, and where it is right
now in terms of the firewalls, I am not clear exactly, but, basically, the
high-wealth portion of their study is a list frame developed from our 1045.
That is correct, and we have done this with all years of the Survey of Consumer
Finances, going back at least to ‘83, and we did have some involvement in
the first, the Survey of Financial Characteristics of Consumers.
Nick, do you want to say something about that?
MR. GREENIA: It is true that the Federal Reserve does receive some data,
but the Federal Reserve is perceived as a 6103N contractor, which means that
the purpose of that tax-data receipt is seen as fulfilling a Treasury
As you may recall, when CIPSEA was submitted to Congress in 2002, there was
a companion bill, and the companion bill had the Federal Reserve in there, so
that they could receive tax data unrestricted for survey of consumer finance
DR. STEUERLE: Tom, you might clarify a little the fact that it is possible
to apply to your agency to have data run by outsiders.
I am wondering if you might also comment the extent to which that is really
restricted by resource constraints, because I am sure everyone who comes to you
to have something run basically impinges upon your resources to some extent,
because they always need some helping hand, but that is a possibility –
MR. PETSKA: Yes, that is a very good point.
DR. STEUERLE: – with respect to health data that may not be even with
respect to any tax question, right?
MR. PETSKA: Yes, that is right. I mean, again, I have talked about
restrictions on access to micro data, but at table-level data, we have
disclosure rules to suppress cells that have fewer than three observations at
the national level and so on.
But, for the most part, if we have an existing file that will meet your
needs, we can enter into a small reimbursable contract to produce tables from
those files and so on.
The problem gets in when there’s matching required or content that we do
not currently have.
For instance, a few years ago, we talked about the idea of non-cash
charitable contributions. Could we produce some aggregate statistics to be
published on that?
We didn’t pick it up in the program back in those days, and so for us to
edit those data, build it into the sample, weight it and everything else was a
very expensive task.
Since then, we’ve gotten a push from Treasury and from Joint Committee to
include that part of the program. So, now, we have, though, so we could produce
additional – from that and so on.
So, again, we do have restrictions on staff, time and so on, but, for the
most part, a tabulation from an existing file, we really try to service those
kind of requests.
DR. DAVERN: Hi. Michael Davern from the University of Minnesota. I have a
We heard from Census yesterday about matching Medicaid records to the
current population survey. Something that might be interesting as well would be
to take a look at some of the W2 information – I don’t know if they have access
to it – which would say deferred compensation for health insurance coverage,
for example, that a person paid pre-tax dollars into an account.
I was wondering if that was some kind of matching study that may be
possible to verify how well the CPS not only measures Medicaid, but how well it
measures private insurance coverage.
MR. PETSKA: That is a good question.
Nick, can you help me out there in terms of does Census get the W2 records
MR. GREENIA: As a result of a regulation amendment two or three years ago,
they now get some limited data that includes deferred compensation from the W2.
So that information is in scope at the Census Bureau.
Where we get into some issues is when Social Security data might be
involved, and Social Security, as you know, has a very unique arrangement with
the Census Bureau, they essentially access tax data and Census data as special
sworn status employees.
So, for purposes of the criteria agreement, the policy agreement that Tom
was talking about that enables access to new projects by special sworn status,
we treat them as, if you will, employees for purposes of this agreement.
The sticking point is that that kind of access, especially if we are
talking about any tax data that Social Security does not have access to, even
when they are matched to Census Bureau and they do have access as far as we are
concerned, Census Bureau views them as special sworn status employees, which
means, for purposes of our agreement, they are viewed as Census employees.
So once that enters the equation, the work has to be done under the
criteria agreement, which means that it has to be predominantly for Title 13,
MR. LOCALIO: Howard, did you want to comment on that at all?
MR. IAMS: Yes, the process for doing that would be to go to the Census
Bureau and apply for permission to use these data at your Census restricted
data center and they have an application process that formally requests a
purpose, and what Nick has emphasized about the Title 13 is – has to be done,
what you have articulated would have a clear Title 13 purpose.
I think the problem you will encounter is that I do not think the code that
isolates health purpose for the deferred compensation is available. It may be
available for the last earnings year, but it was not available before 2005. So
you won’t be able to identify how the purpose of this deferred differs from,
I don’t know if Ron – do you remember? We recently started coding for our
matched data the reason for the deferred – what kind of account it was –
whether it was 401K, 403B, 457, and I do not recall if the health account was
MR. PREVOST: Yes, I don’t believe it was, but we could check into that,
that is for sure. Certainly, the project that we are potentially discussing
here would have a clear Title 13 benefit. I mean, it is –
MR. IAMS: Oh, it would. The question is whether the data are there.
MR. PREVOST: Whether the data are available is the main question.
MR. IAMS: The deferred compensation is identified, but there is a box that
identifies what the deferred compensation reflects, and I don’t recall if
health account was one of the codes. It has only been available in the last
MR. LOCALIO: Well, I think we need to go on.
I want to thank you, Tom, very much for your presentation. We gotta go on
or we are going to run out of time. I am sorry.
Next, we are going to hear from Richard Bjorklund from the Veterans
Administration. While he is getting his presentation set up, I just want to say
that, yesterday, we heard several people comment about their potential dealings
with the Veterans Administration and how those dealings were cut short by that
announcement of a potential breech which never happened.
I do want to say that I got a letter from the Veterans Administration, as a
veteran, saying that there had been a potential breech, and then a letter
saying that it did not happen, and I think we would all be interested to find
out what have been the – if you have any comments about what have been the
repercussions of that from your perspective.
MR. BJORKLUND: Well, just to answer that question quickly, the
repercussions have been a very tightening of all of the policies and procedures
for distributing data both within the organization and also to other federal
agencies or contractors or researchers.
In fact, researchers outside of VA who once had access with stipulations to
VA data have all but been restricted from accessing that data now.
So the security and privacy procedures have been tightened extraordinarily,
and I know some of you from CMS are here, and we have been working diligently
for months to get specific agreements in place and protocols for transferring
VHA to CMS. That is happening shortly, but it has taken a great deal of time,
Well, let me launch into the presentation, the nature of my presentation is
more general in nature than what Tom was talking about.
I am going to be talking about more strategic direction that our
organization is taking regarding linked data and talking specifically about a
project that is underway.
First, we link with our internal data a number of independent activities.
We have an annual survey of enrolles where we try to identify perceptions,
interests, preferences, behaviors of veterans, and we link that to their
inpatient and outpatient clinical records.
We also do customer-satisfaction surveys of fairly excruciating detail, and
we link that also to our administrative data both clinical and cost data. So we
can get a comprehensive assessment of the performance of veterans
Today, I want to talk specifically about a very large project that we have
been undertaking for the last 18 months, and it regards the integration of VHA,
Medicare and Medicaid data in the production of a user-friendly system, and I
am going to be talking about the opportunities that we envision for improving
healthcare outcomes, some of the barriers to implementation that we have
observed going through this 18-month process, more about what the process was
and some of the challenges going forward.
First, to put this in context, I want to spend a little bit of time talking
about the VHA organization.
First of all, VHA is a component of the Veterans Administration. It is one
of three major components. The other two being cemeteries and veterans
We have approximately 156 hospitals, 876 outpatient clinics, nursing homes
The VHA budget is about $35 billion and would rank it amongst the Fortune
50 organizations if we were a private-sector organization. So we are a very,
very large organization and we are a very big player in the U.S. healthcare
Recently, VHA has been mentioned as providing some of the best healthcare
in the country, and it has been management’s objective to be the world-class
healthcare provider for some time, but to continue to be the world-class
provider, we need to continue to be on top of our game, and that is to identify
opportunities to improve quality, cost and access, and, in addition, we think,
by identifying these opportunities, it facilitates what we refer to as a
When these opportunities are identified, it challenges our employees to
think about the best solution to the opportunities, and, hence, old cliches
like not invented here are quickly becoming disassociated from the culture of
the organization and in its place is a constant search for better ways to do
things, better ways to achieve superior outcomes and looking beyond our
organization to the outside world to identify ways to improve that and
essentially becoming globally smart, and we think that integrated data provides
opportunities to facilitate our overall objectives of maintaining our
world-class status and identifying opportunities, and, specifically, the areas
that we think have the greatest potential here are for best practices, and that
is comparing VHA with the private sector along both outcomes and cost
So for any of our 156 hospitals, we would be able to identify where the
biggest opportunity, the biggest clinical opportunity for improvement is or
where the biggest cost opportunity is and what the tradeoffs and the metrics
linking cost and quality are.
So we, essentially, are beginning to use – we have internal resources
devoted to developing risk-adjusted outcomes models, severity-adjusted cost
models. We also have resources dedicated to identifying how veterans make
decisions when they select a VA facility versus a private-sector facility, and
we can compare things like quality, cost, access, benefits and service
characteristics of our facilities versus those in the private sector and look
at the impact of decision making.
As part of this particular effort, we were also able to identify fraudulent
billing practices that occurred where healthcare plans, physicians, offices, et
cetera were billing, double billing both VHA and CMS for the same set of
We think that when more timely data are available or integrated into our
plans that physicians will be able to utilize this data online for treating
And, finally, strategic opportunity identification where we can look at –
from the corporate level, we can identify where we think the biggest
opportunities are, whether they be in cost or quality of access, and identify
corporate-level strategies that would be part of the corporate-level strategic
plan for the coming year.
In terms of barriers to implementation, we have talked about the number of
opportunities, I think, by the description that I have given. The size of these
opportunities are potential huge. So why haven’t we taken advantage of these
opportunities in the past?
These are some of the barriers that we have identified as we went through
this project. First, there were few people that have knowledge about these
three data sets and the ability to access the data.
Secondly, integration is very difficult and time consuming. Medicare and
Medicaid and VHA data are three separate databases that were developed
independently with different purposes, different data and different data
definitions, and so if you can imagine every time in the past when we have
tried to do an analysis of data where we had to integrate this, it was each
time the data had to be integrated for that single purpose.
Data sets are very large and generally require a higher level of
programming skill, and, generally, that is SASS.
Investment in hardware and storage media can be an important consideration,
depending on the number of users.
Potential users of this data have different needs, and it is – those needs
have to be carefully considered in designing a system. One system will not
satisfy the needs of all users.
The size of the potential demand for this integrated data is unknown within
our organization, and, hence, the risk of investing in a large system that is
very expensive and it may not produce a payback.
And, of course, privacy and security laws and regulations add in a very
large dimension to managing this set of data and is something that is becoming
increasingly more important, being elevated in terms of its priority in making
Next, the decision makers, by and large – and I guess I am talking in
general – do not have the experience of using data that is outside of the
organization. Historically – and I think this is true within most organizations
– the focus has been on internal data, and, quite frankly, I would suspect that
– just estimating – 70 percent of the information for solving most problems
comes from internal information.
And so our managers and executives are not – do not have the extensive
experience in requesting the kinds of information that comes from the
And, finally, and another important consideration, is the economics of such
a database, and when we talk about the economics, we are thinking more broadly,
not specifically, at the costs, the dollars and cents numbers, but more broadly
in terms of the fixed cost and the variable costs of these activities and
whether it makes any sense to outsource those variable costs, those costs that
could be converted from fixed to variable.
So the process in this project, first, we undertook a survey of users in
our organization to try and identify both the size of demand and the timing of
demand and also customer uses of integrated data, and those are potential uses,
and we learned that demand would grow slowly, but would, over time, begin to
increase rather dramatically as people learned how to use the data and how to
access to the data.
We hired a contractor who had experience in integrating Medicare and
Medicaid data, but no experience with VHA data, and asked them to integrate the
data and develop a user-friendly system.
The user-friendly system we considered key, because it would expand,
exponentially, the number of users, and, specifically, historically, our user
base for this integrated data have been researchers and data analysts.
We wanted to expand that to what we refer to as the casual users, the
directors of hospitals, the directors of our visims(?) or regions, our chief
medical offices, et cetera, but they are not sophisticated users, and so the
user-friendly system had to be simple enough for them to access the data, and
we thought, as we expanded the user base, that we would be increasing the value
that the organization received from the data.
The pilot test that we put together consisted of data integration of the
three data sets that I have spoken about and systems design.
We had three white papers written which were basically analytical, short
analytical papers by the contractor. He worked on three issues that were top
priority issues for the organization at the time and presented some white
We also did tutorials to researchers and data analysts about the system. We
asked them to come up with a short research topic and to use this integrated
system to address the quick research questions that they came up with.
After the tutorials and research projects were completed, we conducted a
customer-satisfaction survey amongst the users to identify strengths and
weaknesses of the system and of the integrated data. We also did a data
validation study, where we validated the data that came from this integrated
system with the raw CMS data that we have in our files.
Generally, systems design, we have in mind a multiphased project. Each
phase would consist of design, use and assess. So we would be coming up with a
first phase having our users assess it, going back to the design table with the
contractor, redesigning it again and going out, having users use it and
accessing it and until we felt comfortable that the new system could be rolled
out to the entire organization, and I have talked about the three customer
groups, the researchers, data analysts and the casual users we were trying to
In terms of Phase One, we used the five-percent sample of Medicare data for
one year and 100 percent sample of VHA and Medicaid data for the same year and
merged those three data sets.
As I mentioned, we did a customer-satisfaction survey, and it pointed to
areas that were strengths of the system, but also some shortcomings in the
system. So we are prepared to make improvements should we decide to go forward.
VHA, Medicare, Medicaid data were integrated using contractor assumptions;
that is to say that VHA staff were not intimately involved at this point.
Intermediate data products, and these were basically SASS data sets that
were produced from the raw data, were compared to VHA Medicare files and no
significant differences were found.
Issues that came from the satisfaction survey, one was spending more time
learning the system and/or making the system more user friendly were raised.
There were technical questions raised about how to make the system faster, and,
for example, with bigger machines, more memory, processing one year’s worth of
data at a time.
It was thought that managing risk associated with HIPAA, the privacy act
and security regulations might be reduced via using a contractor and
contractor’s customized software, and, in the future, more involvement of FHA
staff was needed.
Question remain about whether there is sufficient demand to justify the
investment, technical and other challenges – whether technical and other
challenges can be overcome.
And some of those challenges are the user-friendly nature of the system.
Can it be made more intuitive. Reducing the learning curve of researchers, data
analysts and clearly for casual users.
We think that a reporting system linked to the output of the user-friendly
system would satisfy the needs, for the most part, of casual users.
And addressing processing time issues is also another one. Processing time
was mentioned by our technical folks. Another side of the story was mentioned
by one of our researchers who said that his time from the point where the
project was initiated and where integration of the data was called for to the
time that he received analytical results was cut almost by a fifth.
Now, at the same time, what we were hearing from some of our technical
folks that it was taking up to 24 hours to run requests for large data sets.
So there is some benchmarking that is required as we go forward and some
internal agreement as to what we are going to measure and how we are going to
MR. LOCALIO: Richard, we have to move on to the next speaker, if you could
conclude as quickly as possible.
MR. BJORKLUND: Okay. There are cultural differences issues that we need to
overcome. I have mentioned the economics and the importance of outsourcing,
and, finally, some organizational issues.
This has been a green-house project. We are not entirely sure whether it is
part of a – the next phase should be part of a planning and policy office.
MR. LOCALIO: Thank you very much.
I want to introduce our next speaker, and, then, while you are setting up,
maybe entertain a question.
Our next speaker is going to be Christopher Chapman of the National Center
for Education Statistics.
Do we have a quick question for Richard on his presentation?
DR. SCANLON: This is maybe a comment as much as a question.
Since you raised sort of the issue of Medicaid data again, we heard about
it before, it creates for me sort of a bigger issue, which is the quality of
administrative data, and while we are interested in terms of linkages to be
able to expand our capacity, there is a question of did we move too far in
terms of reliance upon sort of administrative data, and while averages may sort
of turn out for populations to be the same, when we do validation studies, when
we get down and we start to slice things more and more, we may be on relatively
thin ice, because the data are not good.
I raise this because of sort of prior work that I did at GAO. Medicaid data
is always suspect, and there were efforts that we had to do where we had to go
out and collect new data because the kind of information that comes to CMS is
not necessarily sort of accurate.
It becomes even more problematic as Medicaid moves more and more towards
use of managed care and the variability, in terms of what the managed-care
plans report to the state, is increasing and leaves you big gaps, and so I
guess this is – maybe there’s not an easy answer to this, but I think it is an
issue that we should be thinking about.
It is as much as sort of how much – in terms of trying to protect sort of
privacy, whether people can be identified, it should be a concern about sort of
linked-data sets should be – what is the extent of their strength? I mean, what
can they be used for and what would be pushing their limits too far in terms of
reliability and accuracy?
MR. LOCALIO: Did you want to respond quickly?
MR. BJORKLUND: Yes, in terms of the Medicaid data, yes, we agree. We have
concerns about that.
Our primary focus in our best practices is with the Medicare and the VHA
At the corporate level, we try to what we say – what we refer to as dumb
down the data. So we downgrade it from ratio and interval-scaled data to
nominal and ordinal data, so as to eliminate some of that error, but, clearly,
these are concerns.
MR. LOCALIO: Thank you.
Chris, why don’t you proceed. Thank you.
MR. CHAPMAN: Sure.
Hi. My name is Chris Chapman. I am from the National Center for Education
Statistics, which is part of the U.S. Department of Education.
My presentation is really going to focus more sort of on our experiences at
the center with using administrative record data that have been collected
already through National Center for Health Statistics.
I guess before I get too far into this I should also make a disclaimer. I
am speaking here not for the department or from organization, but more as a
That said, let me sort of jump in and discuss a little bit about the kinds
of data that we typically get at the center regarding health.
Not too surprisingly, we focus most of our data collection on trying to get
information about students and other individuals like teachers that are key to
the education system, and apart from individual-level data, we also collect
information directly from institutions themselves, in particular schools.
Most of our experience there, in terms of gathering health information, has
been at the elementary and secondary level, trying to determine which students
actually have individualized education programs which are specifically designed
to help students with disabilities get the kind of education that they are
going to need in order to function in society later on after school.
The data sources that we normally get our information from are parents,
students and school records. Okay. These are not health-system type data
collections. They are relatively general, and we rely on parents to have
relatively good information about medical evaluations of their children, and we
rely on students to be relatively knowledgeable about their health conditions,
and for school records, as I mentioned before, we really focus in on the IEP
data that schools gather and keep for their students.
However, it would be good and would be useful for us to be able to get more
information about student health linked into the school record systems.
This next couple of slides, I am going to briefly go over the types of data
collections that we’ve got in place, so that you’ll have a better understanding
of what we have available and what we usually work with.
This first slide focuses on the early-childhood longitudinal studies. These
studies are actually in my program office.
As you can see here, we’ve got two cohorts. There is a birth cohort and a
kindergarten cohort. The birth cohort focuses on a group of children who were
born in 2001 and the kindergarten cohort focuses on a cohort of students who
were in kindergarten during the 1998-1999 school year.
The reason why I want to start off with this slide is the ECLS-B, the
birth-cohort study, really is, I think, the center’s most extensive experience
using health-record systems – okay? – in particular the sample for the study
was drawn directly from the birth-certificate record systems that are available
through the National Center for Health Statistics.
And apart from using the birth-certificate data, that data set also
involves some direct health assessments. Our field interviewers actually did
data collections on birth weight, height, cognitive growth and motor-skill
development as the child progressed from birth through at least into
And then we also had, as I mentioned before, many of our data sets, parent
reports of diagnosed disabilities and overall health of the child.
The kindergarten cohort data collection had much of the same kind of health
information, except for the birth-certificate data, and those data would have
been useful to get.
We were not as experienced, I don’t think, as some of the other
organizations here with actually taking an existing data set and trying to
cross link it with administrative record systems. So we did not undertake that.
And apart from that information, the kindergarten cohort also collected
data directly from schools about IEPs for the sample children.
This next slide has some information about a high-school cohort that is
comparable to the ECLS studies in that we are tracking a group of students over
time. In this case, it is tenth graders, and we are tracking them through early
Again, we – the health-related information we have in this collection were
reports from the students’ schools about their IEP status and health-related
programs that the students were in.
We have also asked the parents to provide us information about diagnosed
disabilities that might not have impacted their IEPs, and then we also asked
the students themselves about their health status.
The National Household Education Survey collects data about populations
from preschool through adulthood. Here, our only data source is information
that we get directly from the parents and the students themselves.
The type of information that are on these data sets that would allow us to
cross link to the administrative record systems is relatively limited. The
sample draw for this particular data collection are telephone numbers, and we
have not, to date, found a good way to even cross link the telephone numbers
with strong, address-matching records, which prohibits our ability to link it
into some of the more detailed administrative record-data systems that are out
This next slide summarizes our post-secondary data collections that we have
collected to date. Then we continue to field.
The biggest one here is the National Postsecondary Student Aid Study or
All these studies, again, rely on us getting reports directly from the
students about their health status.
The NPSAS collects a lot more detailed information than many of our other
studies, but, nonetheless, is still a self-reporting system.
Okay. I wanted to get back to the ECLS-B for a moment here, because, as I
mentioned before, that is really our primary experience with this sort of
activity, where we are trying to actually link in existing administrative
record systems with our survey-data systems.
The birth cohort did this quite efficiently by starting out with the
birth-certificate data system that is available. So, as a result, we have a
very rich database available for these children right from their birth, and the
data set is actually rich enough or the birth-certificate data is actually rich
enough that we treat that initial birth-certificate collections point is
actually an initial data set, even though we didn’t do any surveys. We just
basically took the data off of the birth-certificate data and we linked it into
the student record, and we have been tracking it ever since, but that is really
our only experience to date using the statistics with our survey data.
Staying with the ECLS-B, I don’t want to minimize the health-record systems
that are out there. Without them we could not have even done this study. We
wanted to make sure that we had a representative sample of children at birth,
and the most efficient way to do that was to use the birth-certificate-record
In the next few slides, I am going to go through some of the ramifications
of that, one of which is that because the administrative record data are so
rich and it is relatively easy to identify individuals using them, even in a
relatively small end-sample study, with the ECLS-B, in particular, we have gone
to a model where we do not have a public-use data set. Okay?
If researchers in the room are interested in using the data, they need to
apply to the center for license, and, then, we’ll grant you the license, and we
have a relatively stringent – procedure to make sure that the data are not
inadvertently released and that no individually-identifiable information is
That said, much like the VA, we have done a lot of work to try to figure
out ways to make the data more user friendly to the public. You know, you can
get a restricted-use license if you are a researcher and do a lot of very
interesting analyses, but a lot of the times the types of people who want to
get access to the data are school administrators or child-care providers, who
just want to get a general snapshot of what the population looks like. So we
have been working with an online data tool that will allow people to get access
to the underlying micro data, and we have done similar studies that other
agencies have done to make sure that those reports are accurate and also to
make sure that the types of data you can get out of the systems cannot drill
down below groups that are smaller than 50 in number. Okay? That is our primary
Yes, 50. I know. Some people think that’s pretty big. Some people think
that’s way too small. We have gone back and forth over the years. To date, we
haven’t been able to – it with 50. We haven’t really tried to drop it down any
further, but right now, that is where we are at.
In – reports, however, we will and do produce tables that have cells that
are based on ends as low as three. If it gets below three, we go on
data-suppression mode on the cells and collapse them, so people can’t even
figure out that we only have three or fewer cases in a cell.
At the center, most of our data collections are actually done through
sample surveys, and, as I mentioned, with like the in-house – the National
Household Education Survey, the sample frames themselves are often limited to
the extent that you have identifiers that you can easily use to link into
existing administrative records systems.
So, to some extent, some of these crosswalk activities might be of relative
utility to us, but I started to try to think through, well, how could we access
these rich data sets that are out there now and help them inform our studies,
and we have some experimental articles that have been put out that look at self
reports and crosswalk them with the administrative record data on like health
records or medical records, and those studies have been relatively useful, in
terms of us improving our survey items.
One way that we can use the administrative record system is to do
relatively small-end studies whereby we actually can sort of cross link and
purposefully design our study to cross link the survey data with the
administrative record data to see just how accurate self reports are and to try
to figure out ways to improve those self reports.
And another area of research that we probably should consider is taking a
look at linking the administrative record data that are out there on health
statistics with our own school-based administrative records systems.
We have relatively extensive administrative records systems that we have
for both elementary and secondary schools, and also our postsecondary schools,
and, right now, the type of data that we have really focus on disabilities, and
that has to do with some legal requirements that the department has to help
service students with disabilities, but thinking beyond that, we could, if the
data were available, use health statistics and health data to try to figure
out, well, are there students who are not necessarily of disabled status who
could benefit from additional services? And, right now, we don’t have any way
to really get those data, and I think we could use the administrative records
systems to do that.
I have a feeling that this first bullet was pretty extensively covered
yesterday, but one of the key issues that we have and would have with trying to
do some more crosswalks with records systems is actually getting the correct
identifiers in our surveys.
In order to do that, we have to have a really good understanding of what
information is available in the different records systems that are out there
that we might link into, so that we are not collecting data to crosswalk that
only will crosswalk with one record system and not another. We don’t want to
waste resources there.
And then we also need to have some help interpreting the data that we do
get from the health databases.
I just heard a little bit of a back and forth here about what exactly is in
the Medicaid data sets. We’d have to get a better understanding of the
strengths and weaknesses of the administrative records systems to use them
properly. We don’t have that kind of expertise in house right now. I mean, we
really focus on education types of issues.
I think that is it. So not too far over.
MR. LOCALIO: Well, thank you, Chris, and we have time for a couple of quick
questions before we have to take a break.
DR. STEINWACHS: It would help me to get a little bit better idea, on your
longitudinal studies you are picking up information on individual children.
MR. CHAPMAN: Um-hum.
DR. STEINWACHS: I guess linking in some of those information on the schools
and those resources.
MR. CHAPMAN: That is right.
DR. STEINWACHS: Are there other national data collections that the
Department of Education does that tracks children or are they all these – are
they special studies or is there sort of a statistical system that –
MR. CHAPMAN: There is an ongoing collection effort, basically, where we
start a new cohort every so many years of a particular population, so – Then
that ranges all the way from high school up to college.
DR. STEINWACHS: And those would be nationally representative and –
MR. CHAPMAN: They are nationally representative data. That’s right. So
drilling down below the national level really isn’t an option. The cost gets
prohibitive very quickly with these types of collections.
DR. STEINWACHS: Because, on the health side, there are some interesting
issues these days as people are very concerned about the amount of drugs and
medications being given to children, whether it is for Attention Deficit
Disorder or other problems, antidepressants. There are anti-psychotics now
being given to small kids, so on, and, in concept, you might think creatively
about or we might think creatively, I guess, about are there ways in which you
could take information like out of Medicaid or other sources that could tell
you geographically populations that are getting high rates of these and link
them to school districts or things that would tell you something, and I was
just wondering whether or not it was possible with the kind of national data,
and I guess maybe it’s probably a longer discussion –
MR. CHAPMAN: Right. The first answer is maybe. The second answer is we
start those – Especially with the studies that we do of children who are
already in elementary and secondary school system or who are already in
college, we start that sample design with basically school frames. So if there
is some way that we could link a student ID, especially once they are in
college, when we start getting Social Security numbers, then, the linking
process becomes relatively straightforward.
But for the younger children, as you were talking about, we might be able
to do some linking activities through the addresses that the schools would have
for the students and then crosswalk those into the databases that are out there
on health statistics. I mean, we do think about that stuff.
DR. STEINWACHS: Thank you.
MR. CHAPMAN: Yes.
DR. STEUERLE: This question is a question I am going to ask later, when we
get to our final session. So you might just answer briefly, if you have some
answer, but I am involved in so many projects within particular silos of
government organizations, but within the research community itself. So I am
involved with one group that is trying to study children’s outcomes that pretty
much is now focusing on early childhood education and even beyond, and has sort
of at least taken up, whether correctly or not, this model that the earlier we
intervene with children the more the return on the investment, and then I am
involved with groups like this, which is interested in healthcare, and then, at
Urban Institute, we have another group that is helping to work with some of
these longitudinal studies you are creating at Education, and sometimes I
wonder how much do the health, the economic and the education researchers
really get together when they design some of these samples.
I guess it is probably unfair to throw this all on you, except that it’s so
many cases where we are talking about outcomes and opportunities and mobility
and issues like that. People always seem to come back both to early
intervention and to education, and sometimes I don’t know how to link them.
Give you a common example. For instance, some people now think we really
should start at minus nine months, starting to measure what is happening to the
well being of a child, because it could be that drugs and alcohol, depression,
whatever other illnesses within pregnancy could have strong educational
outcomes down the road.
So the question for you is to the extent you get together, you start
designing these models, how easy is it to bring in somebody from HHS and how
easy is it for them to come, and how easy is it to bring in people from some of
these very different worlds and try to really design the models and the
longitudinal studies you have?
MR. CHAPMAN: Okay. I think at the beginning I made a disclaimer that I am
speaking for myself. I am going to do that again here. Not speaking for the
The ease that I have experienced so far has been great. It is relatively
straightforward to contact Health and Human Services or the National Center for
Vital Health Statistics and say, We are developing this study of children from
birth, and we are going to track them through the first couple of grades or at
least through kindergarten. Can we start to have some meetings with staff in
your agency that might be interested in related topics?
And for those early childhood longitudinal studies, we have had a lot of
input from Health and Human Services, and, obviously, we had to get the birth
certificate data for the – cohort, but we have also done some work with – in
our own Office on Special Education – to get better measures and to think
through measures in health.
I think what we run into isn’t always necessarily a coordination problem.
Although, those certainly do exist. We also run into just response-burden
problems, and it is good to see OMB in the room, because we can only spend so
much time with the student in a school setting or so much time with a child in
their house or with the parent in their house without running up to really
serious burden issues.
And they are good issues. I mean, we need to consider them, because, from
our perspective, we want to get as much data as we can on education. So we
focus on assessments and we focus on educational resources in the household and
in the school, and, then, we know, from a lot of research, that there’s health
issues that relate strongly to educational development. So then say, Okay.
Well, we better make sure we get some of those health statistics in there, but
it is rarely the case that we can make it the primary focus of the study. So we
are limited to the types of data we can collect, even with really good
Does that answer your question?
MS. GENSER: I am Jenny Genser from Food Nutrition Service.
I wanted to ask if your surveys contain information on receipt of school
lunch and breakfast, and also if you have obesity data, because that is a big,
big health-related issue that wouldn’t show up in an educational plan.
MR. CHAPMAN: Right. It varies across our studies. I like to keep going back
to the longitudinal study that we have for the early-childhood populations,
because that is the one I work on day to day.
In that particular study, we actually work with USDA to make sure that we
had proper items in there to ask about weight, but apart from that, we also
have some direct assessments where we actually weigh the child and we do a
body-mass index measurement that is part of the data collection.
That said, I don’t want to say that that happens regularly and across the
board in all of our collections. I mean, it is actually relatively unique to
our early-childhood studies.
MS. GENSER: (Off mike)?
MR. CHAPMAN: We don’t ask the parents regularly whether or not children are
in free and reduced-price lunch programs for similar reasons that are causing
problems for the CPS item on lunch receipt.
In order to really get at that, you need a good 10 questions, and we don’t
always have time to nail that down, but, at the school level, we do, in our
school surveys, ask what – how many students in the school are actually getting
free or reduced-price lunch and whether or not the program exists in the
DR. STEINWACHS: I want to thank the panel speakers very much and hope
you’ll stay with us and continue this dialogue. We are at the point where we
promised you a break. We will deliver a break at this time, but what I always
say is five minutes and figure that it probably stretches a little longer than
that. So please take a break and come back in about 10 minutes or so.
DR. STEINWACHS: We are sorry that Dr. Citro is not going to be with us
today. She is ill and sent her regrets, very much, that she couldn’t be here.
So we have just about an hour-and-a-half and split it among three speakers,
and very happy to have Joan Turek, who deserves a very large part of the credit
for bringing together this group, and we did ascertain that Joan is the one who
knows everyone, and so certainly the right person to know, and so over the next
hour-and–a-half, we’ll be hearing from three key speakers on areas of
Maximizing the Benefits from Linked Data: Access for Research and Related
MS. TUREK: Thank you.
When they were setting it up, I asked to facilitate this section, because I
am a major data user, and I am probably very obnoxious about wanting access to
my data and not wanting anything to happen that could limit that.
It is unfortunate that Connie couldn’t be here. There is a disc available
either from Richard Sussman at NIA or from her, that has all of the reports of
the studies that they have done on data sharing.
The first one, in 1985, was called Sharing Research Data. The last one, in
2005, was Expanding Access to Research Data. So it looks like, over the last 20
years, we haven’t solved all the problems.
But we have three very good speakers. We are going to start with Brian
Harris-Kojetin from OMB, who is going to tell us about their activities to
improve access to data, and then we are going to talk to two major users about
what they really want to get.
I think it is important that we have the data available and we have high
quality data, but it is equally important that it is available to the people
who wanted to use it, and I think that the value of the data is really
dependent on our ability to get access. If you just collect it and stick it in
a box somewhere, you may as well save the money.
MR. BRIAN HARRIS-KOJETIN: Good morning.
As the other speakers said, I have a similar kind of disclaimer. In fact,
if I say anything inappropriate, the Director of OMB may well disavow that I
even work there.
But I have been listening here for the past day and – well, yesterday and
this morning, and I am not sure what I have to contribute to your discussion. I
don’t have any data sets. I don’t link data. I don’t directly use any of the
data sets, but I’ll share with you some work our office does in terms of – kind
of related to this in terms of the confidentiality issues and some of the legal
Another disclaimer is, of course, I am not a lawyer. I do play one on TV
every once in a while, but – and I fake it in my job a fair amount, and, as
you’ll see, but, again, I have to disavow any of those kinds of things that I
might say that sound like I actually can make such a policy.
There’s a few laws – there’s a couple of issues on the charge questions
here that I thought I could say something about and see if it is at all helpful
to the committee.
There are several laws that have some impact on data sharing. Here are the
ones that I am familiar with.
One of the things that came out in the presentations by the folks from
Census yesterday, you heard Title 13 mentioned quite a number of times.
Certainly, a primary consideration is whatever statute – whatever authority –
legal authority the agency has that is originally collecting the information,
whether that is being gathered for statistical purposes or whether it is being
gathered for administrative purposes, the agency has to have some defining
authority to gather that information, and, oftentimes, their statutes will
specify what are appropriate uses for the information and if there are any
confidentiality provisions for that, and sometimes these are very vague and,
you know, Go out and gather data on health or on the economy and do good works
and disseminate it. Other times, it is very specific.
One administrative data set that many of you may be well aware of is the
National Directory of New Hires, and those of you familiar with it know that
there are – Well, with the exception, I guess, of SSA, every other agency that
has access to the data has access to it for a specific purpose that is very
carefully specified, and this is – So that is one example of what you are
allowed to access the data for, how it is allowed to be used, and it can be
specified by each agency that may be allowed access to it, and if you are not
explicitly allowed access to it, then, even though you have the grandest
intentions, you can’t have access to it.
Folks yesterday also mentioned a couple of broader laws that apply across
government agencies. The Privacy Act and some routine uses of information came
up. The Paperwork Reduction Act was also mentioned by a couple of folks.
Those of you not as familiar with it may be interested to know that under
the functions of the statistical policy and coordination functions that are
codified in the Paperwork Reduction Act, the director actually specifically
authorizes the chief statistician to promote sharing of information collected
for statistical purposes, but consistent with the privacy rights and
What I am mostly going to talk about this morning, the thing that I mostly
know about, so this is why I assume I got invited, was CIPSEA, and which many
of you are, I believe, aware of and maybe you know everything I am going to
CIPSEA is the Confidential Information Protection and Statistical
Efficiency Act of 2002, which is why we call it CIPSEA instead of all of that.
It is composed of two major titles. First one is confidential information
protection, which was – which you can see the purposes here is really to
strengthen public trust in pledges of confidentiality, prohibit disclosure in
identifiable form, control access to in uses made of statistical information
and ensure that information is used exclusively for statistical purposes.
CIPSEA provides a nice statutory floor for the use of information collected
exclusively for statistical purposes.
The second part of CIPSEA is the statistical efficiency part that applies
only to three designated statistical agencies – the Bureau of Labor Statistics,
the Census Bureau and the Bureau of Economic Analysis.
The goal of this subtitle was to reduce paperwork burden on businesses,
improve the comparability of economic statistics, specifically mentioning BLS
and Census, comparing their establishment – and increasing the understanding of
Quite a number of you, I am sure, are intimately familiar with the history
behind this, why CIPSEA was sought after for literally decades. This went
through many evolutions. There were many bills that came so close, some that
passed the House and then died, and we finally got CIPSEA in 2002.
As Nick noted earlier, if he is still here. I thought I saw him, but maybe
he stepped out, there was a companion bill – a Treasury companion bill for
CIPSEA that – for amending 6103J – that did not – I don’t know if it was ever
even introduced, but has not been passed that is really key to some of the
data-sharing provisions within CIPSEA, but that has not gone forward yet.
But one of the reasons why this has been a very important law is that there
is a real patchwork among many, many different agencies that do some kinds of
statistical activities. Now, there’s 10 agencies that are often referred to as
principal statistical agencies that that is really their sole mission is to do
statistics, but there are many others representative in this room and outside
this room that do some kinds of statistical activities as part of their other
mission that may be regulatory or providing services.
And there have been many attempts over the past number of years to
strengthen and try and standardize the statutory protections for the
confidentiality of individually-identifiable data.
As I was saying before, every agency has their own specific statutes and
has variations in confidentiality protection. Some, like Title 13, are
extraordinarily strong. Other agencies, prior to CIPSEA, like BLS, had
practically no authority whatsoever to – legal statutory authority – to base a
promise of confidentiality upon, and so this really – CIPSEA provides a
ground-level kind of foundation of protection for information gathered for
exclusively statistical purposes under a pledge of confidentiality. So I have
been saying this, now, uniform protection. It covers all data that an agency
collects for statistical purposes under a pledge of confidentiality. There’s
very strong penalties. This is similar to penalties that some agencies already
had, like Census under Title 13, and CES, under their statute, a $250,00 fine
and/or five years in prison.
It also specifically says that FOIA requests are exempt, because it defines
them as a non-statistical purpose.
When we talk about – there’s a few key distinctions that are important in
CIPSEA. One is between statistical and non-statistical agencies. CIPSEA
provides a definition of statistical agencies, those that are predominant –
whose activities are predominantly the collection compilation processing or
analysis of information for statistical purposes.
Statistical agencies have – are given some special privileges, and also
some requirements, extra requirements under CIPSEA. Specifically, one area that
I think many of you have been interested in is this ability to designate
agents, which may be a contractor or an external researcher or – this is
similar to Census’ authority to have special-status employees. CIPSEA
specifically provides this authority only for statistical agencies, not all
The other key distinction that we talked a fair amount about in the
discussion yesterday is this statistical versus non-statistical purposes.
CIPSEA really puts in statute this functional separation between statistical
purpose, defined here, and a non-statistical purpose and draws a bright line
between these two uses; that is, any information collected for statistical
purposes cannot be used for a non-statistical purpose, and a non-statistical
purpose being using the information in identifiable form that effects the
rights, privileges or benefits of a respondent, that we talked a little bit
about yesterday afternoon.
So for the Census Bureau to get administrative data that were used for a
program that were used to effect – were originally gathered for non-statistical
purposes, to take that across the firewall there and say, Now, we will use it
for exclusively statistical purposes. However, it does not go back. CIPSEA is
drawing that same bright line there. Even if your intention was to get – give
it to Census, get better race codes and say, then, Can we have it back in our
administrative records? Sorry.
So what requirements does CIPSEA impose on agencies? Inform the public,
basically, that CIPSEA has to – can only really take effect here is – You are
only collecting something under CIPSEA if you are adequately informing the
respondents that you are going to use the information for exclusively
statistical purposes and keep that information confidential, and we’ve got some
forthcoming guidance that talks about this specifically as a CIPSEA pledge,
and, of course, safeguard the information and protect it. CIPSEA is law for
protecting the information. Honor that pledge that you make to respondents.
In terms of data sharing, as I said, and as many of you are well aware, the
provisions are very specific. Only business data are covered. Only three
designated statistical agencies are authorized for this business data sharing.
So that is all that CIPSEA is itself authorizing. It is important to point out
that CIPSEA is not altering existing laws that may permit other data sharing
among federal agencies, but CIPSEA did not itself authorize any further data
sharing than this business data sharing and – between BLS, BEA and Census.
So some implications here, which I think is – which you are most interested
in that you may, again, be well aware of already.
For federal agencies that are acquiring and protecting confidential
statistical information, CIPSEA may offer some new protections for those
agencies that didn’t have strong legislative protection already.
It does not – It specifically does not restrict or diminish any existing
protections. So if the Census Bureau is gathering information under Title 13,
for example, and they can only use that information for a Title 13 purpose,
CIPSEA does not restrict or diminish that at all. CIPSEA does not say, Oh, but
why can’t you do something that – CIPSEA would let me do any kind of
statistical purpose. It does not effect that is how the lawyers, I understand,
are interpreting it now.
For federal agencies providing access to confidential statistical
information, CIPSEA does permit statistical agencies – remember, only
statistical agencies – to designate agents, to perform exclusively statistical
activities. This is not a requirement for statistical agencies to do this. It
is a may. They may, if they so choose, do this.
This will require from them policies and procedures for access and control,
responsibilities for providing security and employee training and these things
take resources, as everyone in the room is well aware. RDCs are not cheap, any
other means of – Chris Chapman, if he is still here, can tell you licenses are
not free, and –
Implications for researchers, just to kind of wrap up here. One of the
things I wanted to make clear was that some people had viewed the language in
the law regarding agents as opening up researcher access to all data collected
by statistical agencies, which it really does not do. It does provide a means
for statistical agencies to designate agents, but encloses stringent
requirements on those agencies and the agents to protect the confidentiality of
It is important to remember, as I was saying, that CIPSEA doesn’t diminish
any of these existing protections, and so it is not going to remove some of the
barriers that currently exists, and it also doesn’t provide a right of access
to federal statistical data. Researchers who obtain authorization to access the
confidential data, for exclusively statistical purposes, have to share that
responsibility to maintain and uphold the confidentiality of any data they
And, as you all are well aware, different agencies have – vary in the
sensitivity of the information that they have and they may not be able to
provide access to their data or all of their data or may have to do so under
varying circumstances or in different ways, and so researchers seeking access
to those data will have to conform to the agency requirements and respect those
confidentiality provisions, even if those are more limiting and restrictive
than those that they may have for – in their own institutions or they may
encounter with some other data sets, but I think that was – I had to share with
you this morning.
MR. LOCALIO: Thank you for your presentation.
I just have to say that one of the problems that I mentioned yesterday, if
you were here, is that people refer to, The lawyers have done this, and the
lawyers have done that, but the lawyers are not here, and they do not
understand the problems that we are discussing. In fact, I am not sure that
they care about the problems that we are discussing.
MR. HARRIS-KOJETIN: I disagree with you, and some of the lawyers we work
with, but they are very, very well informed on this issue.
MR. LOCALIO: What I find, and I have reviewed some of the legislation, I
have a copy of CIPSEA here that I have been carrying around with me for the
last three years, and I have a copy of the statute that NCHS uses, and they are
in conflict. They are vague. Some of the things are vague, and it doesn’t seem
there has been an effort to reconcile them.
Is there any further effort, post-CIPSEA, to say how these statutes are
going to work in practice? Is there any effort to evaluate the implications of
these statutes, in terms of who gets what, when, where or do they just pass it
and then they say, Well, this is going to work?
What is the evaluation component of – Does OMB have an evaluation component
to figure out whether this is working? And I am not talking about the second
provision, the data sharing among the three agencies. I am talking about
essentially the first one.
MR. HARRIS-KOJETIN: Very pertinent question and I don’t know that the
evaluation is strictly in the law, but I think that is something we do care
deeply about or I do, since I am speaking for myself.
We have got guidance forthcoming on CIPSEA. Some of you may have heard me
say this for the past two or three years. Nick Greenia, in the room, I think
has just given up asking me when it is coming out.
It is actually coming out very soon, but you can’t really trust whatever I
say on that issue, since I have said it has been coming out soon now for two or
So we do have some fairly lengthy, in terms of relative to the statute,
like about 30 pages of guidance on implementing CIPSEA that we will be issuing
that will help agencies that are using that.
We have gone through quite an intensive interagency process to help develop
and inform this. Again, if you folks in the room were on an interagency team
that has helped give OMB input into doing that, and so as that gets out there,
as agencies are going forward, we have had a number of questions that have come
up from agencies in terms of what impact it has on their programs, how they
need to – how it will effect their operations. We have been dealing with those
on an ongoing basis, and, again, have used that to help inform the broader
So it is an evolving process and it is something that every agency is going
to struggle with a little bit in terms of what does this – what kind of changes
and what kinds of things does this do for us, and we have some reporting
requirements for agencies to get back to us on how they are using CIPSEA, how
they are using specifically the – the statistical agencies are using the
agent’s provisions, and so we can evaluate and monitor this and see where
things are working.
Many folks are, of course, very interested in the data-sharing portion of
that and hoping that we could go back to Congress after some time, after we
prove how good of a job that the three designated agencies have done sharing
business data, to see if we could explicitly expand that authority, which is
where I thought you might go, even though you pulled back from that.
DR. STEINWACHS: I think, just clarify that. There was a reference in there
to business data, and so that was the sharing among BLS and the others of
information on employers in business activities in the U.S.?
MR. HARRIS-KOJETIN: Exactly. Yes, economic data, and so the business lists
between Census and any of the economic surveys that Census and BLS did.
DR. STEINWACHS: Thank you.
MR. PETSKA: Can I make a comment also?
Again, speaking from my own personal view, I wish that CIPSEA would have
gone further and kept in the tax component of that, because when CIPSEA, when
the early discussions of CIPSEA were unfolding with Brian’s boss, Kathy
Wallman, at OMB and other representatives of the federal statistical agencies,
there was a lot of valid research purposes that were articulated for sharing of
data, including tax data and so on, and once CIPSEA started to be formulated,
it became very clear that the – one of the more controversial aspects was
tax-data sharing, and the question is – I don’t know if it is because of
congressional committees. Clearly, there was not strong support on the Hill.
Randy Krosner(?), in the Council of Economic Advisors office, was pushing
this and so on, but when he left the administration, it seemed like that piece
had no possibilities at all.
I spoke to him a couple of months ago at a conference in Cambridge and his
comment was, CIPSEA, as it was, was the best deal we could get, that we would
have liked to have expanded data sharing to more agencies, including the tax
component, but it was clear that that bill would be dead on arrival, which is
MR. GREENIA: I am Nick Greenia from IRS and I just wanted to add a couple
First of all, I wanted to address the previous question in terms of the
evaluation, because the guidance that is, we hope, going to be coming out next
month – Brian, is that right? – in the Federal Register –
MR. HARRIS-KOJETIN: Absolutely.
MR. GREENIA: The guidance is actually – I was on the other agency committee
for that. So I can speak to that a little bit.
I think the outlook is unknown. I think the answer to your question is it
is not clear, and one of the reasons I say that is because there’s a lot of
flexibility in terms of how agencies can safeguard the data, how they make the
data accessible to researchers, and I think what may come about is, if you
will, a de-facto evaluation, which is that researchers and Congress, if there
is another data-sharing effort, are going to look at the experience and they
are going to see that, You know what, it depends on the sensitivity of the
data, in terms of what sort of safeguards are prescribed, and procedures, and –
are on, and there is going to be a lot of flexibility, and there is going to be
a lot of variability in terms of how the data are accessible and protected.
And I just wanted to add something to what Tom Petska said, since Connie
Citro is not here, as you may know, since that report on the data-sharing
workshops was released last Friday, and if you would like to get a – I am doing
a little plug here for Senstat(?), of course – but if you want to get an idea
of some of the difficulties, including the recommendations facing tax data for
purposes of data sharing, I would highly recommend you read the article
coauthored by Mark Mazor and myself on tax data and some of the many, many
issues that have to go into that.
And picking up on what Tom said regarding why the tax-amendment bill did
not go anywhere, as you know, the tax-amendment bill accompanied CIPSEA in July
of 2002 to Congress, and CIPSEA proceeded to the floor of Congress, and the
J-bill, the amendment to the tax bill, went to Joint Tax Committee, and there
were a number of reasons, we think, that the tax bill foundered. Tom has put
his finger on one of them, which is the leadership vacuum, but there are some
other lessons that we think are valuable as well, including freezing the items
in the statute itself, as opposed to allowing infinite regulations to stipulate
item content in the future.
So I highly recommend you take a look at that article for tax data.
MS. TUREK: Thank you.
I have one question also.
We have been talking here about new forms of data, that would be the survey
and administrative data linked together.
Has OMB begun to look at the implications for users and to look at whether
or not we would need any kind of new statutory language to permit this kind of
data to be shared? Because, I mean, it would not be identifiable, presumably,
but there was the risk-disclosure issues.
And will OMB take a position or will it look at what should be done to help
users get access to this data once it is available?
If you do a SIPP that’s got a match to administrative records, will you
look at whether or not we could have a public-use tape?
MR. HARRIS-KOJETIN: I thought David was going to look at whether you can
have a public-use tape or not.
You’ll have a public-use tape.
MS. TUREK: Thank you.
I mean, I just wondered if OMB had a role in this or –
MR. HARRIS-KOJETIN: We could have a role if we need to have a role, if it
seems that that would be helpful.
Obviously, when you are bringing in the linked data, there are other things
that go along with it. The example I was giving before, an agency certainly can
promise confidentiality to a statistical – Well, a statistical agency can
promise confidentiality to another agency, and so BLS, for example, does this
all the time when they gather information from states. They take data that
states may not consider confidential, but say I will use this for exclusively
statistical purposes and I will keep it confidential, and once it goes back
there, then BLS intermingles it with their other information, and, in essence,
they have elevated the level of protection required. Just like whenever
anything gets intermingled with IRS tax data, it gets elevated to that status
MR IAMS: Could I make a comment?
I am Howard Iams from Social Security.
I really think Tom is correct and Nick is correct. You have to have
legislative authority to permit a broader sharing than currently exists. The
agencies do not have the authority to pass data to other agencies for use at
those other agencies for statistical purposes, and this is separate from
disclosure and confidentiality. The agencies just cannot pass it and cannot use
it without some sort of legislative authorization, and Brian can – I don’t know
– perhaps disagree on some instances, but I don’t think that the agencies that
are interested in this can go further than what they are doing now, and a lot
of the limitations that you are hearing about are created by this legislation.
Now, the disclosure raises a – Well, let me finish – My train of thought
would be that if you are a CIPSEA-authorized or a CIPSEA-compatible agency,
which I think I am in – We are a statistical outfit in a big administrative
organization, but we just do statistics and policy research.
There ought to be, ideally, permission to share confidential agency data
with such a group for them to use it however they wish for whatever purpose
they want, not just Title 13, not just Title XYZ, but that it is a
statistical-analysis function that might have policy implications, might not,
but it is for a statistical purpose. It is not going to go and administer
somebody’s benefits or effect some individual’s rights and whatever, as CIPSEA
is defined. That would open up a whole lot of sharing that outfits could do and
a whole lot further analysis with this type of information than is currently
Once you bring in the disclosure, confidentiality issues, you raise a whole
lot of other things, and my only comment would be, the thing that I think
undermines almost any public-user file is geography.
We put a copy of our new beneficiary survey – It is sitting on the web.
It’s got two surveys. It’s got earnings from our tax records. It has hospital
records from the Medars(?) file. It has our benefit information.
We cleared this through – with Pete Saylor’s(?) help at IRS through their
requirements. It meets all their confidentiality requirements. It meets
Medicare’s, CMS’s – at that point it was HICFA. It meets SSA’s. The key is
there is no geography. It is a big country out there. There are a whole lot of
characteristics that you could say are unique, but you really can’t tell,
because it is a big country out there.
If you know what state someone is in, it is over. It is not a big country.
It is a small state.
Now, we have a national program. For Social Security, it doesn’t matter
what state or what locality you are in, but if you are dealing with TANIF(?),
you want to know the state.
I, being selfish, think that they ought to put out all this administrative
data on a national file with no geography, and if you want to use geography,
you should have to go to these research data centers. That is what the
University of Michigan does with their Health and Retirement Survey. There are
restrictions. Users can use it in University X in Alaska, Hawaii, whatever. The
only place you can do geography is in Ann Arbor, Michigan. That is the price
you pay to have geography with those data. If you’ve gotta have geography, you
will never have a public file with confidential data. It will not be possible
in today’s age. My judgment.
MS. TUREK: I find that fascinating.
I think we ought to go to our last two speakers who are both users, and
actually will be from a very different perspective, I think.
Heather Boushey is an economist with the Center for Economic Policy
Research and Dr. Deb Schrag is – I guess you are on the staff of the Memorial
Sloan-Kettering Cancer Center, and so we have here an economist and a medical
doctor who are both heavy data users. So I think it’ll be really interesting.
MS. BOUSHEY: Great. Thank you. Thank you so much, Joan. Thank you for
inviting me to speak here today. It is a pleasure to have the opportunity to
talk to you about the way that we use data.
Before I talk about the main points that I want to make, I want to tell you
just a little bit about myself and my organization and what we do, because my
understanding is is that is why I have been invited here to speak today, to
talk about how we use the data that agencies make available.
I am an economist. I work at a think tank here in town called the Center
for Economic and Policy Research. We are a very small shop. We have four
economists, about a staff of 15, and we do research on economic issues facing
people here in the United States.
We are very heavy users of the CPS and the – the Current Population Survey
– and the SIPP, the Survey of Income Program Participation – although, I don’t
know, in this audience, if I need to spell out those acronyms, but I am so used
to doing it.
We make use of this data in a very timely manner, both to effect media
debates and policy debates around pressing policy issues.
We work on very short time frames, and, because of that, we have spent the
past five years taking both the SIPP data and the CPS data, both – we are
working on the March(?), but we have done this for the org – and creating what
we call Uniform Data Files.
As many of you know, if you use survey data, they do these fabulous things
at Census and BLS where they’ll change the name from year to year or they’ll do
little things that mean that you can’t just write one piece of code and pull
out the data from every year.
So you have to invest a lot of time if you want to know what has gone on
between 1973 and today or ‘79 and today. If you want to have a time
series, you have to sort of make this huge up-front investment.
So we do that, and we have made all of this publicly available on our
website, all of our code and our uniform extracts, but we have made this
investment so that when there is a debate on the Hill or when the media says
something that is inaccurate about what is going on in the economy, we have a
data set that is up and running, that is on our desktops, and we can then
comment on it in days or weeks, rather than months or years, and I know all of
you know just how complex this work is, and being able to do the work up front
and have it available for timely analysis is a critical part of our mission and
how we use the data.
Which is not to say that we don’t have longer-term research projects. We do
projects that take years, but it builds on this uniform data file and it is
always with this goal of policy work.
We do not do any linking with administrative data, because it is certainly
beyond the kinds of timely work that we could do.
I do have experience working with administrative data in another life, when
I was at the New York City Housing Authority. So I do have some understanding
of just how complex some of those issues are, but I have not matched it, just
so that – I don’t want any questions about that. I have never done it. Don’t
want to. Sounds complicated.
So being able to have data on our desktops that we can use quickly and
accurately and that we have confidence in because we have already done all the
background work has been critical to our success, and it is that major point
that I want to relate to the main points I want to make to you here today.
We are concerned about timeliness of the data, and we are concerned about
our access to it. The question that Joan said about public-use files is
critical to what we need to know.
And we are also concerned about – and I don’t know how germane this is to
this topic, but we are concerned about maintaining access to survey data. I
think – and I’ll talk about that just for a few seconds at the end.
So my first two concerns about timeliness and accuracy of the privacy
issues, of course, they are linked, and so the major questions that we have are
will administrative data that has – survey data that has administrative data
matches lead to delays in releasing the data? Will it be less timely than it is
And, second, will it require new security measures, kinds of like the
things that have already been talked about requiring us to go to special
locations or special sites to use the data, and that will significantly both
delay our ability to use it, but not just delay it by days or months, but could
delay it by years, because we wouldn’t be able to have it sort of up and
running and ready to go when we have an issue.
Now, I have not been doing this kind of work for maybe as long as many of
you have, but I have heard tales from people who got their Ph.D.s in the
‘80s and before that it used to be the case that if you used survey data,
you had to go to special computers, because you couldn’t do it on your laptop,
and I hear they had these things called cards and that it was very time
consuming, and what I find – the point here is that I think that the way that
we are able to use data now has transformed the way that we are able to engage
in policy debates, both at the national level and the state level.
The fact that we have access to this data, not just on our desktops, but I
have access to it on my laptop at home, I can do it on the train, that we are
able to do these very complex kinds of work in a much faster way than we had
been able to in the past.
I think you can – There is some correlation between that and the rise of
think tanks like mine and private research organizations that are effecting
policy debates, both at the state and local and national level, and I think
that that is an important accomplishment that technology has given us, and we
don’t want to sort of move backward in any way. So I think that that is just a
very critical, critical point.
To give you a couple of examples, when I have been asked to testify in
front of Congress, both in the House and the Senate side, I had no more than
two weeks’ notice, and, in one case, I had just five days’ notice.
These are very short time lines, where, if you want a specific number from
your data, you need to just be able to go to your computer. You don’t have time
to go and to sign up and to wait.
But the second kind of concern we have about timeliness is that we often
only have a few months lead time to know what the issues actually are. We spend
a lot of time thinking about what kinds of policy issues are going to come up,
what do we need to prepare for over the next year or two, but we may not know
whether or not Congress is going to vote on minimum wage this session or next
session until a few months out.
So having to go through application processes is simply just not viable for
those of us engaged in policy debates. It is perfect, it is fabulous for
academics, and we build on their work and use it, but we need things that are,
of course, shorter.
And on this issue, I might add that this is something that our organization
is on sort of the more progressive end of the political spectrum, that we have
been working very closely with the Heritage Foundation on these issues about
access and timeliness, because they are just as concerned as we are, and it is
certainly a point that transcends, I think, political boundaries and sort of
goes beyond left and right, which I think is a very important point, and
especially this issue about having independent organizations be able to access
government data to discuss pressing policy issues is one that we all, both on
left and right, agree on.
I basically am making the same point over and over again. So I won’t be
that much longer here. We need access to timely data. So, hopefully, that has
gotten through here.
So the final question in the set of questions that we were given ahead of
time to focus on was what are the potential costs to the public from failure to
take advantage of these opportunities, and I have a couple of comments on that.
First of all, one of the largest projects I am engaged in right now, we are
looking at take-up or effective coverage of benefit programs in 10 states. We
are doing this for advocacy purposes, for public policy, and we are doing it
using the SIPP and the CPS and the National Survey of American Families.
Now, we know that each of these data sets has significant problems with how
people report their benefits, that there is under-reporting of benefits. It
would be absolutely fabulous to be able to have data that is matched, so that
you can look at eligibility for public programs and then get the numerator to
be an actual estimate of coverage. That would be a significant improvement,
and, right now, we are working on this project, because, in the states, many of
the state groups that we work with are very concerned about take-up of public
programs, and because there is no real place that makes a lot of this
accessible across a wide range of programs, because the eligibility rules, on
the one hand, are so complicated, if you want to look at eligibility, you have
to use survey data, and, quite frankly, the SIPP is the only survey that I have
used that has enough questions to really get at the complexity of eligibility,
which, of course, as Mr. Iams said, this is all at the state level, this game
is all at the state level, but we really don’t have a good numerator, because
we don’t have administrative matching. Being able to have that matched data
does have – I mean, significant policy implications that we could be using
right now. So that would be wonderful.
Of course, all these issues about privacy, I leave it to you all to sort
that out, but we would love to have access to it.
But I do have a couple of concerns about sort of the move to the matching –
particularly, and, of course, my perspective is thinking about this matching
either SIPP or CPS or ACS, one of the surveys that I have used, with
My first concern is, with my limited experience using administrative data,
I think that there are concerns about accuracy and how high the – what the gold
It seems to me that we need three kinds of data. We need administrative
data that tells us one thing, but there are biases and there are – obviously,
there’s problems and there’s errors in that data as well as there is with
survey data. The biases run in different directions.
We need survey data to tell us about the wide populations, the full scope,
and I think we also need qualitative data to tell us some of the why questions,
but that is a whole different group of people.
This question about whether or not the administrative data is always
perfect and what we are going to gain from matching and how we talk about that,
especially with the public and especially with the people that are trying to
convince how important this is. I think it is an important note to just note
that some of the caveats and some of the potential problems with that, in terms
Second, administrative data is clearly no substitute for survey data, and
this cannot come at the expense of these surveys.
Again, looking in the issue that I look at, eligibility for public
benefits, this is something – and we need to know how many folks are eligible
for Medicaid and who aren’t receiving it or who are receiving it. The only way
we can do this is through surveys that ask a ton of questions of people at a
subannual, a monthly level, because this is the way that people access these
I mean, and just to go off on that just for a second, one of the things we
learned from our work with the SIPP – and looking at take-up – is that people
are – People move up – their incomes move up and down month to month, and when
they access the system is not necessarily the month that they don’t have any
income, because it takes months – weeks or months for them to even make it to
the office or get on line or get all their papers together to receive Food
Stamps or another benefit program. You need access to the survey data that
provides you with those dynamics. They are not substitutes.
And then my final point, which is going back to my theme here, if the cost
of matching is that we lose in terms of timeliness or the ability for the
public to access public-use files, then I think that is a serious concern and
one that we should spend a lot of time focusing on.
And I think, having said that message about 12 times here, hopefully, it
has come through, and I think I will stop there. So thank you very much for
allowing me to speak to you.
DR. SCHRAG: So I am Deb Schrag. I am a physician and health-services
researcher at a big cancer center in New York City, Memorial Sloan-Kettering
Cancer Center, and I also appreciate the opportunity to speak to this group.
Unlike Heather, I don’t do anything in days to weeks. I am more
representing the perspective of academics, and we do everything on a months to
years time frame.
I am going to – I guess, I have to say that I had a completely different
talk when I came here yesterday, and perhaps it is being towards the end of the
workshop, I revised my slides and got rid of most of the data examples and
slides that showed the results of various linkage projects I have been involved
in, and put in what I’ll call more philosophical slides, because I think some
sort of more conceptual framework – Maybe it is just – at the end of a
workshop, it feels like a conceptual framework tying together all these
different enormously complex issues that we have heard about for the past two
days is sort of in order and maybe there’ll be some discussion in that regard
at the end.
Again, I represent an end user, not my institution or any agency.
So types of research questions, examples of linkage attempts, challenges
that we have encountered and, of course, a wish list to add to Heather’s.
So I guess I am here representing academic health services researchers, and
we examine, obviously, relationships between need, demand, supply, delivery and
outcomes of healthcare.
The big topics for us, I would say, over the – since – in this decade – and
I think that these are going to remain front and central on people’s research
agendas – are disparities in healthcare, access and barriers, technology
dissemination. Quality measurement is a big one, and, ultimately, efficiency of
healthcare delivery. So that includes all the cost issues.
We talk about data – and I think that this is an underlying theme of many
of the presentations that we have heard in this workshop – is that these data
We start out with source populations, basically, United States citizens,
who have IRS data. They work. They don’t work. They do save for retirement.
They don’t. They exist in specific geographic regions of the country, and on
top of the source populations are diseased populations. I happen to work in
cancer. Other people work in mental illness or psychiatric disease or
malnutrition, all kinds of examples of what one consider a diseased population
or a population with a health concern of interest.
On top of that are providers. Typically, these are physicians, but other –
nurses, other types of healthcare providers as well, and, on top of that, are
healthcare delivery units, facilities, whether they are clinics, hospitals.
The issue with federal data is that federal data is better at the bottom of
the pyramid. Federal data has a lot of information about source populations and
some – For example, Dr. Breen is here from the National Cancer Institute. A lot
of information about populations who get cancer. It tends to be a lot less rich
as you go up to the top of that pyramid and have a lot less information about
providers and facilities.
So when we try to link, very often, in my experience, what health services
researchers are trying to link is the rich, rich government data at the bottom
of the pyramid with more granular detailed data about providers and facilities
at the top of the pyramid that often resides outside the public domain, if you
will, and I think we have heard allusions – you have heard references to some
of these data sources from the speakers yesterday. AHA data was mentioned, AMA
data, and I’ll give you some examples.
We talked about evaluating the quality of healthcare. Really, we are
interested in health outcomes, and the main ones – main health outcome we get
out of big federal databases are typically just very basic things like who
lives, who dies and who gets particular diseases. So mortality and incidence,
And the inputs that we want to go – that we want to relate to health
outcomes are community attributes; person attributes; health risks and
behaviors, which come from the big surveys; the structure of delivery systems
and the processes of care, processes of care. I would probably put Medicare
data in that bucket. Medicare data is what exactly are we doing to these people
– all of us – that lead to these health outcomes.
As we think about linking data, I think it is helpful to think about what
the frameworks are for putting data in different buckets, and, now, obviously,
some data belong in multiple buckets. Medicare data also has mortality, which
is an outcome, but I think that sort of as we conceptualize these linkage
exercises, it is helpful to think about in what domain the data sets belong.
The other thing that I think is helpful to think about, and I always think
about when I am contemplating any sort of linkage project is where it lies
along the spectrum of pure population-based data.
So federal-agency data are best for big, broad population-based analyses.
So a cancer example is I want to know something about everyone in New York
State with lung cancer, and I can go to the registry and Census data, but, very
often, increasingly, we are interested in quasi-population-based data, where we
want everyone in New York State with lung cancer who is covered by a particular
private insurance provider – Oxford Insurance Plan – and so I think it is
really also important when we talk about linkages to be clear. Is this true,
pure population-based data? Are we trying to link federal agency or state
agency data with some external data source that resides elsewhere? We need some
kind of nomenclature for where those boundaries are.
And, then, of course, there is non-population-based data. It may be that –
You know, my research institution is always coming to me and saying, Well, you
link these data and you work with these large population data sets. We want to
know why cancer patients in New York State are not all coming to get their
treatment from us, since we are the best center, and you have all these data.
Why can’t you do that?
And I say, guys, that is marketing. That is not health-services research.
That is not an appropriate way to use these data, but explaining sort of
analyses at the population, the quasi-population and the non-population data,
we need some kind of taxonomy, and I think simply having that taxonomy or the
government help develop standard taxonomy for these types of activities would
be a really helpful place to start and would educate the end-user community.
Obviously, the health services research strategy – I mean, we just all want
to get our hands on as much data as we possibly can as quickly as possible, and
we want to juxtapose all these various data sources.
So the kinds of things that we work on, really, I would say the focus and
theme of our research is to look at what we call the implementation gap, which
is the difference between clinical efficacy and effectiveness.
So most healthcare – What works in healthcare is discovered and described
in these very neatly, nice-packaged little clinical trials where we take 100
people and give them blue pills and 100 people and give them red pills, and we
decide that the red pills are better.
What we get out of that is efficacy. These red pills work, but that really
doesn’t tell us anything about what happens when we unleash red pills on the
When we unleash red pills on the population, we are trying to measure
effectiveness, that gap between efficacy and effectiveness – me and others call
the Implementation Gap, and we are really trying to get at what the reasons are
for those gaps and trying to identify important sources of variation and
particularly those that we can do something about, and we want to know whether
the reasons for the gap are endogenous to patients, doctors, healthcare
systems, background population. So that is really the unifying theme of the
research and why access to these data are so incredibly important.
We told you a little bit about this data source yesterday. Very simple
example, a kind of chemotherapy given after an operation for a particular stage
of colon cancer, and this is a big deal. Fifty-thousand Americans get this
condition a year.
Do patients, in the Medicare population, who are insured and this kind of
chemotherapy is covered. Do they actually receive these treatments?
Well, we went to SEER-Medicare, and, very quickly, within a week, were able
to answer the question.
Now, of course, it took a while to get the data. Once we had the data, the
analyses took a week. So, again, I spend 90 percent of my time trying to get
data, manage permissions and so on, and actually analyzing it is a lot less
But we identified a very simple finding, which is that there is a very
steep gradient, and that, although we treat most young Medicare beneficiaries
with this kind of chemotherapy, we really don’t treat the older folks.
Well, this very simple finding, made possible by linked data, really
sparked a whole set of subsequent more detailed analyses to go back to
physicians and patients and to conduct interview studies, to really hone in on
what the reasons are that underlie this important healthcare-delivery pattern.
Now, one of the problems here is that the older patients were never
included in the randomized trials, those little efficacy studies. So doctors
don’t know what to do. So there is uncertainty, and this is what actually
Okay. So there is a real circularity where we do these population-based
data analyses with linked data, and that, basically, catalyzes subsequent
studies to really get at the underlying reasons.
So we have this nice linked-data set, but we really wanted to know – We
said, Look, these people are not getting a kind of chemotherapy that they
really ought to. What is going on? Are they just not being referred to medical
oncologists who have the therapy? Are they refusing the therapy, even after
they go to medical oncologists and are old people just saying, Thanks, but no
Well, to do that – Sorry. This is Census data that just shows that we also
– Census data can be helpful because we see that if you are married, you are
much more likely to get the right kind of treatment, chemotherapy, than if you
are widowed or single, and these are all adjusted for age, and these are very,
very strong findings. So having Census data can be very, very important, and we
can figure out who is at risk for getting inappropriate medical care.
So why doesn’t everybody get chemotherapy? Do people refuse? Do they see a
medical oncologist? We could use the UPINs on CMS claims, and CMS claims have
access to some specialty-code information about the types of doctors patients
receive, but that data is not particularly complete, updated or accurate. Maybe
Gerry can comment on it. There are other better data sources for figuring out
So when we use just the specialty datas, we could see that among the
patients who got chemotherapy is represented by the green bar. Most people saw
an oncologist. The top 20 percent there in the chemo bar, those people got
chemotherapy, but, apparently, did not see an oncologist. All those people saw
internists. Well, that is just because CMS doesn’t know the difference between
an oncologist and an internist, because the data is not coded well, but we
wanted to get our hands on better data sources.
The people who didn’t get chemotherapy, most of them made those decisions
without seeing an oncologist.
So we are able to basically do these kinds of analysis to say, We, in the
healthcare system have a healthcare delivery problem, patients are not making
informed decisions not to get a treatment. They are making uninformed
decisions, because they are not even going to see the relevant providers.
People look at this data and it wasn’t all that compelling because they
say, You can’t even figure out who is the medical oncologist. So, then, we
wanted to get AMA data.
It took essentially 18 months to get the AMA data, which is much more
complete, to be able to do the linkage to really prove that to a higher level
of satisfaction. Very complicated to do.
Ultimately, successful, and, then, the green bar went all the way up 99
percent, and the green bar in the no-chemo group went up to about 40 percent,
and our conclusion was, essentially, that the mechanism was people were not
appropriately being referred.
So a wish list from an end user would be linkage of UPINs on claims data to
files that describe position characteristics.
AMA data is better than CMS data. Data from the American Board of Internal
Medicine, the American College of Surgeons, all the specialty societies, is
still better than AMA data, and the untapped resource is state-level data. That
is the most complete and the most difficult to obtain.
So I have a license to practice medicine in the State of New York. They
know lots about me. They know whether I have ever committed a felony, been in
jail, all the tests I have taken. They maintain it. I pay them $250 every two
years to update that license.
So we don’t have that data to link to federal. So with respect to health,
it is really critical to know physicians, physicians’ characteristics,
distribution of physicians.
Next on the wish list is pharmacy claims. The analysis I showed you, we
want to know were people not getting intravenous chemotherapy because they are
getting oral chemotherapy? Are people not getting supportive medications? Are
they sticking to their therapies? Are they getting appropriate pain control?
So the wish list would be Part D data. We heard about that yesterday.
Medicaid data for pharmacy claims, and private claims data sets. There are
enormous pharmacy-data clearinghouses that are very – that have not been widely
linked, but are very important for health-services research.
So, again, the example I just gave you involved taking federal data set and
trying to link it to external data sets that are not federal. So I think
developing some kind of taxonomy or framework for what these linkages are – Are
you linking federal to federal? Are you linking federal data to state data? Are
you linking federal data to private data with a broad public relevance?
And I put AMA data, AHA data. Those are private data maintained by
non-profit – Well, yes, AMA is a for-profit organization, but they are big
organizations that have – really control, monopolies on important data sets
that pertain to health that have broad relevance for many researchers in and
outside of government.
And, then, there are custom data, where you have your own personal data set
about the patients in a particular region or with a very specific set of
disease that you have that you then want to link to.
And I think developing some kind of taxonomy to understand what the
activities are would help frame development of coherent policies and rules that
researchers could understand.
So an example of a study I am working on – and Gerry Riley has been
extremely helpful to us here – are to look at capacity to deliver mammography
in the United States.
Women in the United States, age 40 to 80, need mammograms. A lot of them
are unscreened. There are big racial disparities, and the lack of available
facilities, mammography screening centers, and lack of radiologists are
potential explanations for suboptimal use.
So the question here is does lack of capacity explain geographic variation
and racial disparities? And does capacity predict breast-cancer incidence and
Well, these kinds of analyses require geo coding. They require knowledge of
where the facility is, and you can get that from FDA accreditation data. Where
are the radiologists? Again, that is physician data. Where are women
unscreened? BRFSS, Medicare data are informative there. And where are there
high rates of breast cancer? SEER.
To do these kinds of analyses, we want data ideally at the Census tract
level, but if we can’t get it, we’ll go less granular to the Zip Code or county
Trying to do a project like that and figure out where to start to go to
obtain the permissions is extremely complex. Approval from one agency or many.
So some kind of central clearinghouse and more clearly delineated set of
procedures would help.
To get these kinds of projects done, we are really dependent on personal
relationships with key individuals who sit in specific agencies.
For example, Gerry, in this case, helped us get access to the FDA
accreditation data for mammography facilities, but people who don’t know Gerry
can’t do this, and that doesn’t really seem fair. Although, we are very happy
we know him.
Finally, I want to talk a little bit about area versus person-level data.
Access to granule-level data helps most health-services researchers, and
privacy security concerns obviously involve less risk when you are talking
about area as opposed to person-level data.
I think, again, in all these discussions, it is really clear it really
would be helpful if we delineated between what we are talking about, and I
think Howard was alluding to this before.
So, again, area-level data at the bottom, state, county, Zip Code, Census
tract. On top of that, you often have anonymized patient data, which can be
linked to unit area, and, on top of that, what is the most vulnerable is
individual patient data, and I think one thing to think about, in terms of
linking federal-agency data sets, is to release them – again, I am not
pro-public-use data – but to make them available to the research community with
appropriate bars and jumps and hoops you’ve got to jump through is to make them
– make area-level data available without making individual patient-level data
and have discussion among the agencies of what the hoops are to get state,
county, Zip-Code, Census-tract level data with higher bars the more granular
So, again, this would really help us with very repetitive common tasks that
we end up performing again and again when we try and make maps and figure out
where are the patients, where are the providers, where are the disparities and
where are the mortality rates.
Wish list in that regard would be access to chloropleth maps by various
geographic units, very useful for common-data elements and Census and
survey-data results. It could be a shared resource for investigators, and ARC,
GIS and other software packages that have really just become available in the
last three or four years have catapulted the possibilities in the ease – how
easy it is to do this light years ahead. I would say even in the last three
years. Others might wish to disagree.
I think I am going to skip this and talk, finally, about Medicaid, and if I
had to put something at the top of my wish list it would be for the federal
agencies to help us figure out how to tap in more effectively to Medicaid.
I think the states just don’t have the organizational capacity to get this
going, but the federal government does, but the states really care. The largest
component of their budget, talking about healthcare, I guess, at this workshop,
Medicaid funds healthcare for the poorest, sickest members of society. It is
really an untapped resource. I think CMS has done an enormous amount of work
over the past decade trying to get the data into some common file structures
and make it a little bit easier to work with, but it is still an untapped
I think when we talk about Medicaid data, we want to distinguish between
two things. One is enrollment. Who are poor people enrolled in Medicaid versus
the administrative data which is what is done to those people, are they getting
EKGs or chest X-rays or what units of healthcare are they actually consuming,
and it may not be so complete for the latter, but very informative for the
former, and I think we really have to not let the perfect be the enemy of the
good when we talk about Medicaid data.
So this is just an example of a study where we tried to link
cancer-registry data from the entire State of California to Medi-Cal, which is
Medicaid claims for the State of California, because we wanted to know about
delivery for cancer care to poor patients, and we were wondering what would the
yield be from a linkage of California cancer registry data and Medi-Cal.
So what we did is we started, and you’ll note on the left, with all
incident cancer cases reported to the California Cancer Registry, which is very
complete, and we took 98 cases, and if you look at the cervical-cancer example,
there were 1,690 women diagnosed with cervical cancer in the State of
California in that year. About 80 percent of them were 18 to 64 at diagnosis.
We don’t care about the older ones, because we have them in Medicare, right? So
that is about 1,350 cases.
What proportion of them were enrolled in Medicaid? About 21 percent. If you
look down at hepatoma, it is about 35 percent. So these are cancers that are
associated with infectious diseases. These are important health problems,
Hepatitis B and C and the HPV virus can be prevented. That links, therefore,
back to health surveys.
You know, we think that 21 percent of the population of a state who has a
particular cancer or 34 percent is a meaningful fraction and that figuring out
how these people are cared for and whether they are getting antecedent
appropriate care and appropriate care subsequent to diagnosis is important and
that these kinds of linkages could be leveraged further.
A lot of problems were encountered. So for the example of Medicaid, we have
big problems figuring out the duration of Medicaid enrollment. So we had
enrollment files for two years, and we looked over two years – in this case,
‘97 and ‘98 – only about half of the cohorts were enrolled for the
whole 24-month period, and a lot of people were in and out, and the way these
denominator files are maintained is very confusing.
We had about 74 percent of patients who were enrolled during the month of
diagnosis. Some were first enrolled after and some before.
When we looked inside the claims to see how often a diagnostic and
procedure code inside the claim corroborated the diagnosis that we found in the
California Cancer Registry, the answer was about 80 percent of the time. We had
access to a year of data. So if we looked over the whole year, it was about 70
percent. If we restricted the analysis to people diagnosed in the first half of
the year, it was 80 percent, because it can take a while for these codes to
catch up and appear in the claims.
Is this perfect? No. Is this useful? Absolutely.
So SEER-Medicaid, we attempted a link in California. It took us two years
more to obtain the data sets.
The denominator file structure limits our ability to identify cohorts of
the chronically poor easily. Challenges are retroactive enrollment, chronic
versus episodic poverty, spend-downs – that is, people for whom illness
precipitates enrollment – variation in state thresholds in generosity when we
try to do this with other states, and I won’t go into it, but, believe me, are
trying, and definition of an HMO or managed care. Just some of the challenges.
But we think that sort of a coordinated federal approach that is helping
will help the states and some of the states would undoubtedly be receptive.
This is only going to get done if federal agencies get involved.
So our wish list would be consistent definitions in the Medicaid enrollment
files – What does managed care mean? When are claims itemized? When aren’t
they? – Linkages of Medicaid data files to state discharge abstracts – So most
of the states maintain hospital discharge registries. You can’t link that
information – geocoding of where Medicaid beneficiaries reside. Linkage to
pharmacy data, and linkage to Census tract and socioeconomic variables.
Priorities. Coordination of procedures for obtaining access to data and the
review process – and I am not in favor of public use, but a series of hoops and
whether you have to actually go sit in Ann Arbor or you just have to describe
how secure your computers are, whatever. There need to be various stages and
Standardization of reporting rules. So SEER, which I work with a lot, has –
SEER-Medicare – you can’t put in any cell N less than are equal to five to
protect patient privacy, but other agencies have different rules and have to be
less than 10. Some standardization, and, I guess, harmonization would be
Anything the federal government can do to help us federate state data would
Making chloropleth maps available for common tasks that we do again and
again in a common-base-type system, and working with the states to fulfill
analyses of Medicaid enrollment and claims files. Those really would be an
end-user’s wish list.
Very ambitious. You can ask, right? So that’s – I’ll stop there.
MR. LOCALIO: Thank you both for comments. I certainly can relate to both of
you in your tasks.
Heather, I just want to say to you that your particular point about having
access to data and why you need access to data is something we do understand.
It is something that I have raised previously at our committee meetings and our
subcommittee meetings, although, in terms that are somewhat more colorful.
And I do want to stress a point that you made, that we have been talking
about technical issues of access. We have been talking about protecting
privacy, but the point that you brought up is there is the other issue, that it
is important for organizations, other than government, to have access to data,
because there are opinions other than the government’s opinion about data. Let
the data speak for themselves.
Unfortunately, people who work in government are not always free to say
what they want and report what they want, and the example that I have raised at
committee meetings before has to do with HHS, has to do with Medicare Part D,
has to do with an analyst named Foster, who wanted to release information about
the cost of Medicare Part D, and he was told by Mr. Scully(?) that he was going
to be fired if he did. So he did not release those data. That is well known. I
got that information from the press.
Now, I, on the other hand, if I had that information, I would have just
said, this is it, and nobody would have done anything to my job. In fact, I am
encouraged to report things like that if they are of interest.
Now, you may have a particular point of view in your organization, but
there are others that you mentioned on the opposite end of the political
spectrum that have their points of view, but I think we have to stress that it
is important in this entire discussion to know that the data need to be told.
They need to be told, and even though we have to bear in mind it is not just
researchers getting access to data. It is people getting access to data, so
that they can let the data speak, and I just want to emphasize that do not feel
that your point has not already been raised or has not been heard in this
DR. STEINWACHS: Probably my education, Deb, chloropleth?
DR. SCHRAG: Oh –
DR. STEINWACHS: I thought this might have some surgical, medical procedure,
and then I decided, no, it didn’t quite sound like it, because –
DR. SCHRAG: No. It is the technical term for – You have all seen them. You
see them in the newspaper all the time. They are, essentially – I wish I could
explain the derivation of the word. Unfortunately, I can’t.
Essentially, they are those maps that you look at that have, typically –
they can get very fancy, but the typically have population density. So, for
example, you might take all the zip codes in the United States and rank them by
anything. It could be number of graduates from high school. It could be number
of foreign-born persons in the Zip, the Census-level data. Right? So you
essentially – and you can get fancy, so you can look at the relationship
between the number of foreign-born persons and the incidence of stomach cancer.
Can look at the relationship between not speaking English in the home and
mortality from a particular cancer, engagement in a particular – Right?
And, sometimes, the maps are shaded, you know. So, typically, they are
shaded red to blue, and then they have dots on them. Those are chloropleth
maps, and standard data files are really helpful to make them.
DR. STEINWACHS: Helping my education.
Let me take you to another thing. You raised the idea of having data
clearinghouses, and there is a kind of function in the private sector that many
times corporations use where their claims data, pharmacy data and so on go into
a place that standardizes it, makes it into something that is analyzable. CMS,
in the past, did some things with Medicaid data, in the old days –
DR. SCHRAG: Resdac. Resdac. We work with Resdac, which is a research data
clearinghouse center. Maybe Gerry could probably talk about it, but they don’t
have everything that we want. Okay? We love Resdac. They are on our speed dial,
but – We know them by name. They know us, but there is a lot that they don’t
have. They are chronically under-funded, et cetera, et cetera.
DR. STEINWACHS: So I guess there were – and you are getting at it – sort of
two questions in here.
One is what would you see a clearinghouse doing? And the reason I was –
Sometimes, it is just you have the data there. The other is you actually make
the data more useable, and so one of the issues across the states on Medicaid
data – and you were pointing this out and so on – is that –
DR. SCHRAG: What would you do?
DR. STEINWACHS: – make it more useable for researchers, there may be a
variety of things you would do that is not necessary for the state.
So maybe getting both of you to sort of comment, what would a clearinghouse
do, and are there some examples that you think are ones that could be looked at
DR. SCHRAG: So I’ll give you a specific example of the kinds of things
Resdac, which already does a good job, could do even better if they had a
broader mandate – I think they just need a broader mandate – is they could say,
If you want access to the Medicaid enrollment files, these are the hoops you
want to – you need to jump through. If you – an anonymized version of that. If
you want the actual unencrypted data file, you need to jump through a few more
hoops. If you want the enrollment files plus the claims files to figure out who
had obesity surgery, you need a few more hoops. Really laying that out in a
very clear way.
So one hoop just to get the enrollment file. That way, I can figure out how
many poor people there are in a particular area. Maybe that is all I need.
A few more hoops, a little bit more difficult, and I do think you need to
set up barriers in researchers’ ways, so that they are really clear about what
they are going to do with the data and what they need it for.
If you make it too easy for me to get it, I’ll just say, Give me
everything, when I often don’t need everything. Sometimes, I just need little
bits of it. So they can really help researchers figure out what it is that they
need at what level and setting up a progressive gradient of barriers.
And Resdac does some of it, but I think that they could do more. Gerry
MR. RILEY: (Off mike) – that Dave Gibson talked about yesterday, and they
are trying to get the data assembled into a format that is easier for
researchers to use, to pre-identify people with certain kinds of conditions –
searching through the claims forever to sort of identify people with common
conditions and things like that. So that might be one example of where this
data base will go beyond what Resdac normally does for people.
As far as the Medicaid data goes, there has been a great deal of effort put
into just trying to get very basic – you know. So I haven’t worked with
Medicaid myself, but there’s been a lot of work in our office, particularly by
Dave Bah(?) and other people to try and just get consistent measures of
enrollment and claims data and so forth, and they made a great deal of
progress, but I think the data are just starting to be used now on a much more
wide-scale basis than they have in the past. Can’t speak too much about that.
MS. BOUSHEY: I would just add one quick comment. I mean, I think, in terms
of many of the surveys that we use, using the administrative data capacity to
correct and amend many of the program participation elements would be
wonderful, and I could imagine you could do that without – one could imagine
just doing that to the public-use files, rather than – and making those
available in the same way that you do now, rather than – and sort of
eliminating that step, so that researchers didn’t have to, which then means we
never have to see the Social Security numbers or whatever.
MR. PETSKA: Can I just comment on that?
I thought some of that is done already using the tax data to edit the CPS
and the SIPP, which would go into public-use files and so on. I believe that
has already been done.
DR. STEUERLE: Just a quick comment and a question. Now, my 2-1/2 minutes is
down to 1-1/2.
Heather, I just want to say I – like Russell, I fully identify with your
statements about timeliness, particularly in the policy process. I mean, so
much research is oriented towards developing the status of things two years ago
or three years ago when Congress is constantly in the midst of making changes
that have enormous impacts, minor – not a minor example – an example being the
recent drug benefit that spent over – depending on how you do your present
value calculations, over $1 trillion, and often with almost no data input
relative to even some of the other things research.
But I guess there is one gap in our data that has always bothered me for a
long time. Deb, you are the only – really the only speaker who has referred to
it, and so I am going to put this question to you and you may not have an
answer. You can tell us afterwards, that is to do with trying to integrate in
the provider data.
Because my background is largely in areas like budget, I know the way or at
least I study a lot the way that economic systems work, and I know that part of
what goes on has to do with the cost of these systems, and part of what you are
talking about in the way some parts of the country provide benefits, some
don’t, largely relate to cost, and, sometimes, even the incomes within those
particular geographic communities.
But an important side of this is, if we don’t get at some of who is getting
these benefits, these costs, we are not very far in our – While we do develop
modest data on expenditures and what we are buying, we develop almost no data
on who is getting the money.
So, for instance, HHS doesn’t even do what I would call simple – not so
simple – quite a few people who do it – an input-output analysis, so when costs
rise by 10 percent, we know who is getting it. Is it 20 percent more to doctors
or do we have 10 percent more doctors and 20 percent more practical nurses, and
how much of it is going for administrative costs? How much is going to the
And we don’t even have that, and you are the only one who really mentioned
the provider data, and I am just curious whether you have any suggestions for
ways of getting at some of the cost side of this equation by looking at the
DR. SCHRAG: I think you can get at the cost side, because you have the
UPINs and you know which UPINs are doing what to whom. So I think that you can
If you want to look at things like physician income, that you need separate
survey data, but, procedurally, in terms of access to data, the biggest thing
that I think needs to be fixed and doesn’t seem to me to be that complicated to
fix is that detailed information about providers is not within the government –
the government doesn’t have that organized at all.
The AMA makes, I think, $50 million, some humongous number, and most of it
is sold to private companies, but they sell the AMA database, and people use
that AMA database, which profiles all physicians in the United States, and they
sell it again and again and again.
The government needs to do it itself. The states know who is a physician
and what their characteristics are, but the federal government doesn’t – I
don’t know. It doesn’t suck up that information from the states. Maybe there
are complicated reasons why not, but that would really help a lot, and that
would be a good place to start, and I think it would help the states, because
they have interests in these kinds of data for fraud and other things that they
really care about, and cost-related issues.
DR. SCANLON: We don’t know enough about providers. We don’t know enough
about cost, but we do know an awful lot about both. I mean, and maybe the
analysis doesn’t get presented sort of widely enough, but it is known. I mean –
what Russell mentioned, in terms of Rick Foster, the actuary’s office really
does a lot of work on the issue of cost, and we do know sort of where Medicare
costs are going. We do have the Medicare Payment Advisory Commission, which is
looking sort of at the cost reports that the providers file, and, in fact,
this, in some respects is much better data than what the private sector will
You do not want to know what VHA response rate to particular items on their
surveys are, okay? As opposed to, if you are a Medicare-participating hospital
you have to turn in your clinical report.
This is the kind of thing, I think, where we need to move incredibly sort
of forward in terms of improving the comprehensiveness and the quality of the
data we have, but it is not that we have been sort of completely sort of static
here or sort of – or have ignored the problems. We really – we know a lot.
There is a large group sort of, over at GAO. There is a large group sort of
in the Office of the Actuary. There is a group sort of at MEDPAC. That are all
doing these kinds of analyses of exactly what I think you are talking about.
And the idea of using propriety data, these organizations don’t necessarily
want to share sort of the data at the level at which we would like to have to
be able to access their qualities and be able to be confident about them.
One of the things that I worry about here – and this is a democracy, and,
as Russell said, we would like to sort of have the information out there.
Information can be used for good and bad purposes – okay? – and I think one
of the things that people within federal agencies probably would think about in
terms of release of data — is this going to potentially cause harm, because it
is not very good data? Would they release data where there was only a
30-percent response rate on their survey? That would be astounding.
When I was at GAO, our response rates, the requirements were that we would
be in the 60, 70, 80 percent range before we would use a number. That is not
true of data that are coming out of private surveys. Private surveys, they are
happy to be able to say, We did a survey. Here are the results. Okay? And,
then, they can turn around and sell those because there is a market for them.
DR. SCHRAG: But the point is is that the states – I completely agree with
you that the private data – The AMA data are terrible. Absolutely. Nobody
should use it. The states have good data. It is just not accessible.
DR. SCANLON: You are right. It is not accessible, and it is potentially not
uniform. I mean, that is the other key thing about starting down the path of
saying we are going to go to the states. We’ve got the 50 states plus the
District of Columbia. If we are talking about Medicaid, we’ve also got five
territories that run programs. Try and assemble a consistent database from
those. It is incredibly challenging.
MS. TUREK: That is easy compared to TANIF, or, in some states, the data is
collected at the county level.
MS. BOUSHEY: Yes, or childcare subsidies.
MS. TUREK: What?
MS. BOUSHEY: Childcare issues, county level.
MS. TUREK: Thank you all very much. I imagine after our last session we can
continue talking about users’ needs forever.
Anyway, for now, everybody go and have a good lunch.
(Whereupon, the workshop recessed for lunch.)
A F T E R N O O N S E S S I O N (1:18 P.M.)
DR. SCHEUREN: I am going to talk about the past because I am an advocate of
Deming, and Deming says you should only talk about something you know about.
Well, I don’t know about the future, and I probably will know it when I see it,
but then it’ll be the past. So I will talk a little bit about the past.
This is a very interesting subject, record linkage. Deeply connected to
health, quite expanded as these days were on other topics. Very deeply
important subject, and so let’s see how people thought about it 40, 50, 60
years ago or more. I call it the Once and Future King. So you may have seen
that line before. I don’t know who – Somebody may have used that.
The Book of Life. This is a concept that we are putting together a book of
people’s lives. That is what we do with linkage. It could be contemporaneously
this way, but it also could be this way, and all of those variations in
dimensions have occurred in these meetings. Great idea.
Started out – Dunne(?), I think is the one who used this phrase. Started
out right around the ‘30s and ‘40s that this phrase began to appear,
and just as we had the ability to do the kind of large-scale record linkage
that we all now do.
I am going to put this one up. This is the Chalk River Nuclear Power Plant
in Canada. We were talking about Gil Beebe(?) a little while ago, with Nancy.
Gil was an advocate of record linkage, because he’s an epidemiologist. This is
really where things started, and Howard Newcombe, who did an awful lot of the
fundamental work on linkage worked at Chalk River and it was all
Social Security did a massive amount of linkages, epidemiological linkages,
to look for various carcinogens in various industrial processes. A massive
amount of work done. Great stuff. Not talked about anymore. Great stuff. Still
can be done, I believe.
Most of our problems of that kind have been seen and acted on. We don’t
have the kind of problems they have in China when they do a cancer map in
China. They find that most of the cancer problems in China have to do with
differences in the way food is prepared in China. So we don’t have that
problem. We all eat McDonald’s, which is to say we all have a higher level, but
The theory showed up with Ivan Fellegi. Howard and Ivan are Canadians. It
is not an accident that I am going to talk about Canada. An awful lot of the
best work on record linkage has been done at Statistics Canada. Fantastic
tradition. Ivan is still Chief Statistician at Statistics Canada. Great man.
And we ran a conference in the names of these two people here in Washington
in 1997 on linkage, an international conference.
We had done an earlier conference, which I think is the – one of the things
that Jean associates with me. I was organizing both these conferences, was in
‘85 – which is more a local conference on record linkage.
Both of those are at the FCMS website, but I think they are huge PDS. You
have to really be serious if you want them, but there are people here – Listen,
at this time of the day, on the second day, I figured there’s six people here.
I was wrong. Now, don’t all leave. Wait for him.
This wonderful piece of work by Marks – that is the same Marks. Carol
Krotki and Bill Seltzer. Bill Seltzer is still with us. He was the Chief
Statistician at the UN for a long time. Now, he does a lot of important work in
other areas, including human rights.
And here is a tremendous book, Bishop, Feinberg and Holland. These are all
contingency-table books. They are really valuable, and they are worth knowing
about in order to understand the error patterns in the data search you are
using. If you do not understand the error patterns – non-sampling error
patterns – you really haven’t grasped the whole thing.
When you are doing linkage, you actually really can improve considerably
the quality of your data in many, many ways, and that has been talked about
quite well here, but you there is no free lunch, not that there are many
economists in the room. Gene took his badge and put it away today. He is not an
economist today, right, Gene?
One of the things you can do with these systems is you have three or more,
you can do multiple systems estimation. If you have two, you are doing dual
systems estimation. This is the traditional pattern that has been used around
the world in evaluating censuses. It is very modeled appended(?), and the
bureau has moved carefully away from that towards multiple systems, okay? Very
carefully. I wish they went a little bit faster, but, anyway, very important
What about content – A lot of work has been talked about here about that. I
am going to give some names to you. This is Mitsuo Ono, who used to be the head
of the Income Branch at the Census Bureau. He did some very important early
work on matching income-tax returns to the CPS, okay. I think it was the 1970
CPS. Yes. I think it was – Maybe it was ‘72. I thought it was the ‘70
Dorothy Rice. If you don’t know Dorothy Rice, you better leave right now.
Just leave right now. Major hero of mine. She used to be at Social Security.
That is when I used to work for her, and then she was the head of the National
Center of Health Statistics. She is a tremendous individual. A great force.
I interviewed her a year ago. The article appears in the September issue of
AMSTAT(?) a year ago. Not this – well, it is two years ago now. This is
September. Two years ago. Worth reading. I interviewed her, but listen to what
she said. Don’t pay attention to what I said.
Joe Steinberg is the person who started the linkages at Social Security
that we have been talking about the last few days, including getting the Social
Security number question on the October 1962 Current Population Survey. That is
when it was first put on the survey. Of course, it is not on there anymore. It
was taken off. He did a great deal of work, very good work. I came on after him
and tried to finish what he did. He went on to be Assistant Commissioner at the
And Ben Bridges was my boss at Social Security. I needed to put his name on
here because he used to invite me to his house a lot. So good guy. Really good
guy. Well, he’s a really good guy, and he had the patience to read everything I
wrote and fix it. Since then, it has been bad.
Then we ended up producing a whole series of products called the
Interagency Data Linkage Series. Very dated now, but full of mathematical and
statistical ideas that are still valuable and have not been recaptured anywhere
else. I am sort of proud of that, pleased with it.
There are some other things that happened as a result of that work.
I am going a little too slowly. I can tell that from the three people in
the back who have woken up now and said, Where is the next speaker.
One of the key champions of augmenting survey data with administrative
records is Gene Steuerle who is here. Okay?
Another one is Howard – Howard Iams. I hope you listened to what he said
yesterday. Dead on. Dead on. Dead on.
And Julia Lane. I happened to come in – I know Julia from other worlds,
when she tried to recapture the essence of the Continuous Work History Sample
state by state. Amazing individual. She gave a great talk yesterday.
Two more things. Gene Rogot, whom you don’t know. He was at the National –
PARTICIPANT: (Off mike).
DR. SCHEUREN: Pardon me?
PARTICIPANT: (Off mike).
DR. SCHEUREN: Yes, he convinced people, and CHS and the Census Bureau to
match the CPS’s to the National Death Index. I asked yesterday if that was
continuing. It is continuing. Okay? That is a really interesting process, and
that has been published. You published it, didn’t you? It was published. A
Million Deaths. It was a publication a few years ago.
It is a marvelous piece of work, because you can look at social-economic
differentials in mortality with that, which is my interest in mortality, by the
way, social-economic differentials. I am not going to go down that road today,
because that is just too much fun.
And Joe Peckman, a lot of people know Joe here, at least Gene does. The
idea of – matching data when you can’t do it exactly is a valuable idea. It is
a heuristic that I urge you to use with great caution. I have written
considerably – I know. I have written considerably on its weaknesses. Okay? It
is not always weak, but if you are desperate, do it. If you are not desperate,
wait for the real thing.
Let’s talk about optimizing of systems. That is – What we have been doing
is taking the existing systems and making them better, but what if we were
optimizing systems, what would we do?
Well, this is not a real long list. What we want to do is we want to
prevent the survey errors from occurring – all right? – to begin with, if we
can. Very hard in a day when non-response – item and unit non-response are so
high. We want to build in a detection system so we know there is an error and
we need to fix them – repair them, and, of course, one of the best ways to do
those – the last step is to replace them with data from a better source. So the
linkage is an enormously important quality-improvement step.
Now, it was said by somebody this morning the quality of the thing you link
to may not be perfect, okay? It certainly wouldn’t be, anything I have ever
One of the things that’s going on is that there is a tradeoff between
response variance and response bias. Administrative records are typically
biased. They are not measuring the right thing, and if you think it is
something that it isn’t, you’ve got a bias, even if it was perfect – okay? –
but they do get rid of a lot of the response variance, which is very
characteristic of surveys.
If you have ever done any linkage yourselves and you have compared
essentially equivalent – never exactly – equivalent concept from an
administrative source or an operating source in a more general world, which is
the one I am in now, with a survey, you see this enormous variation in the
survey results. Rounding errors. All kinds of things going on in the data.
Maybe the right signal is in there, but an awful lot of noise.
Playing with Matches is a book. This is a plug. Three people in the back,
the one who has fallen asleep, you know, you should wake up now, because you
have to buy this book when it comes out. It’ll come out next year. Okay?
It is about data quality. It talks about all the traditional ways we have
looked at data quality. Everyone in this room who has done – handled data has
used these techniques in various ways, and it talks about linkage, all of the
aspects of linkage, most of which have not been talked about today, but,
fundamentally, linkage, in my opinion, is really to replace one data source
with a better one. Okay?
If you want to study error patterns, that is good. That was done a lot. I
don’t think that got us very far, frankly. What got us far was replacing bad
data with good data or better data. There is no good data.
Okay. A couple of more slides. I guess I got four. Three. I’ll make it
Privacy and confidentiality. One of the big problems if you are in an
administrative agency or a statistical agency is you really don’t understand
the language the same way. There is a conflict of principles, really, between
If you are at the Census Bureau or NCHS, you are a part of a culture of
confidentiality. If you are at the IRS or Social Security – Social Security is
a bit on both sides – you have the culture of privacy. You focus on the privacy
of the person. You are – that person’s data is sacred to you.
Those two values do have an intersection, but it is sometimes very hard to
find, depending on the setting you are in.
I think more work like this meeting, more joint work, designed work for
joint goals can help deal with some of this, but it has existed for all the
decades that I know about and have heard about in all of these different
processes, and I don’t think it is going to go away any time soon.
I want to make a comment about – I am using an industry coding example. The
statistical agencies say to the administrative agencies, We can’t give you back
the data after we clean it, okay? Well, because we would violate the trust we
made, and so I think, though the statistical agencies need to look at the point
of intervention where they get the data, and look at whether they could get the
data at a different point and thereby aid the administrative agency – and the
industry coding example, which I am not going to cover, is a perfect example of
a great deal of waste with Census coding things and the IRS coding things and
the BLS coding things and the states coding things – okay? – that are none of
them done well. Okay? All of them might be done better, if we were to fix the
Legal and bureaucratic. Lot of discussion about law and practice links. You
are all discouraged by it. Get a lawyer in the room. I mean, get some lawyers
here. Really – need to do this again, and experiment, and I believe in the need
to do continuous measurement of what is going on.
That is one of the reasons I really like this group. I didn’t realize that
you don’t meet as much as you might, and, of course, I hope – I mean, there’s
great ideas here these last two days. I hope you are stealing each other’s
practice, best practice. Don’t steal each other’s worst practice. Steal each
other’s best practice. I don’t think I have to tell you that, but, sometimes,
you are not necessarily sure you know where it is.
One of the things that NCHS does, which I absolutely think is fantastic, is
they have an IRB. All right? Every agency here who does linkage should have an
IRB. If they don’t, there is a real issue there. Okay? I really, really think
that is what we should do, and I am not going to name names, because I know
some of the agencies here who don’t, but they should. It is very important that
you be held accountable by your peers, okay. Should be held accountable by
other stakeholders, too, but by your peers, because your peers can help you fix
I have been subjected to IRBs in lots of settings, in private settings.
Somebody was talking about using the National Survey of America’s Families this
morning, which is a survey I worked on for a couple of years. Doesn’t do any
linkage, of course. One of the problems with it.
Let’s talk about learning linkages. Let’s think about – it as a learning
system. We have been doing it forever, but we haven’t thought about it as a
learning system. We could have. Just didn’t.
We need to continue these conferences. Fundamentally, keep talking. Keep
listening. Collect and publish a summary. I don’t mean 100 pages. Forget that.
Two pages. Five key points. Okay? Four contacts with people. Okay? Really.
If you got into somebody’s remarks the last two days, talk to them on the
phone. Get going, okay? Only do the things you are interested in. Don’t do
anything else. Doesn’t matter. There’s enough interest in this room so a lot of
good things will happen.
I have said that.
I want to see diagnostics developed for linkage. I am a big fan of
diagnostics. I have learned a lot about diagnostics from regression. Many of
you who are economists do regression, log, linear, logistic regression, if you
are in epidemiology world or standard regression if you are in some of the
other worlds in this room.
Build diagnostics. We need to do this. This is very important. Not hard.
Fun, actually, and then you can get rid of some of the other errors in here and
get some of the misleading things out.
And here is one that you won’t agree with. I put match in first, because I
wanted you to find people at equivalent levels and swap staff. Two, three
months. Right? As long as you can stand it in another agency. It is
fundamental, really fundamental. We are not learning fast enough. This is a
shame. All these smart people and we are not – We are all living in our various
stovepipes – okay? – smoking something. No. No. No. Wrong generation. You never
did that, did you?
Okay. What’s happened here? Somebody has taken over here my computer. They
say, End of show, here. That is what they told me, End of show. I am almost
Thanks for the memories. I have gone back to something I used to do, and
some of it I still do, and best of fun on our road. I am saying – including
myself – our road ahead.
DR. STEINWACHS: Thank you very much.
DR. SCHEUREN: You are welcome. Sorry for too much past.
DR. STEINWACHS: So, Mike, did Fritz set the stage for you?
DR. DAVERN: He certainly did. Yes, it is very hard to follow Fritz, of
course. Now, everybody can really go back to sleep back there. Right.
He had most of what I had to say here. I’ll give some examples, I suppose,
or some of what I had to say. I don’t really know the names or the history
quite as well. I almost left the room, unfortunately, when he said, If you
don’t know this person, why don’t you leave the room?
DR. SCHEUREN: Dorothy Rice?
DR. DAVERN: Yes, I know –
DR. SCHEUREN: You know Dorothy.
DR. DAVERN: So, first of all, I would like to thank Joan for inviting me
here to be a part of this. I think it is really important, and I am looking
forward to giving you my thoughts after a couple of days here or at the end of
it, and I get to be the last speaker, so everybody is eagerly anticipating the
last slide. So I have it duly marked, so you’ll know when it is coming.
Basically, administrative data and survey data are really kind of collected
for different purposes, right? We all know this. Survey data are collected for
research, for the most part, and research file, administrative data are to
administer our program, and we want to put these two things that are sort of at
odds together to do really good health research, health-outcomes research.
So I am going to do some ramblings and musings. I don’t really know much
about administrative data, other than what I have learned from people who are
in this room. I know Dave Ball was here yesterday. He taught me an awful lot
about administrative data from – on the Medicaid side, and so if I get
something wrong, feel free to correct me, but I am just going to give you my
impressions are of what is going on.
So I am going to stick with what I know, which is survey data, and I know
survey data fairly well and been working with it for a long time, and, then, I
am going to see how administrative data is sort of like survey data, in some
ways, and see if that can be a useful exercise, and talk about much of what –
People have already brought up a lot of the issues I am going to talk about.
But here is what I have: I have several concerns with survey data for
health research, and then I am wondering how administrative data compare on
these issues, and then I have issues in merging the two sets of data or
matching the two sets of data, and then work left to do to fulfill the great
potential, I think, of these merged and matched data sets.
I won’t spend much time on the data stewardship, privacy, confidentiality
stuff. It has been covered well, I think quite well, elsewhere, and it is
So I am going to start at the end just in case I don’t make it to the end.
This is what I want people to understand from at least my point of view.
There is great potential to health research to be done with these linked
survey and administrative data files.
Survey micro data are in the public domain, and that is really important.
Heather talked about it quite a bit here today, and the importance of having
that out in the public domain for policy research.
There is also the importance of having it in the public domain for the
strengths, and especially the limitations of this data are extremely well
known. Sometimes that is thought is a weakness of the survey data, that we know
a lot about it and we know a lot about its limitations.
We don’t, unfortunately, have that same kind of information about the
administrative data, because, obviously, it is on – it is not in the public
domain, and researchers can’t do research on its quality.
So because these data are not in the public domain, it is really imperative
that the limitations be thoroughly investigated by the people who are entrusted
with these data, more so – You know, a lot of it is going on, but if we are
going to put these data sets together, we really need to have this information.
And there needs to be documentation and research on these linked files, in
other words, metadata – you know, data – information on the actual data
elements themselves, how they were collected, how they got in the data file,
where they came from, and a lot of information on the process of how that
information was produced. That needs to be put out into the public domain, so
if I am reviewing an article for a journal and someone is using this
linked-data file, I know how that variable was produced.
If I don’t know how that variable was produced in the administrative data
file, it makes it hard for me to review an article and know if that correlation
or regression coefficient they found is actual or just something that was
created as a part of the administrative process.
Certainly, NCHS, Census, NCI, Social Security, AHRQ, everybody here has
these agreements in place and ideological people to start producing this work,
both the documentation and the research on the data itself. I think that is
We need to have that research done. We need to get all this information out
into the public domain, and these are the people who can do the work at the
moment, because it has taken a couple of years for our – the project I have
with Census to really hit the ground running, just because of all the
agreements that have to be in place, and it is impossible for researchers on
the outside to do this kind of work.
Survey data have extremely well-known limitations. Okay? Just to give you
some of the highlights – or low lights, as you may see – survey data concerns
that we have are sample frame coverage error. We have talked about that quite a
bit, and Fritz brought it back up again. We have sampling error and variance
estimation. You have non-response error, both item, non-response and unit
non-response. It is becoming worse, certainly, with both of those.
We have measurement error, things like collecting data from mixed modes.
There’s a lot of people who study whether or not if a piece of information was
collected through a self-administered questionnaire. It is different than if it
was collected through an interviewer.
They have data processing amputation editing, and there is always need for
better documentation of metadata on the survey data side of things. There is no
doubt about that.
And all these things, I think, are extremely well known about surveys.
Survey data are dirty, messy and not for the timid, and I highly recommend that
– you know.
So when I am talking about administrative data, as a survey researcher, I
know that survey data are messy. I have made a living off of pointing out to
people that the survey data are messy. It is something I publish quite a bit
on. So, in general, I think that knowing the survey data are messy is a good
And so how are the administrative data unlike or like survey data with
respect to these main issues that we are concerned with survey data?
And there is, of course, a great variety of administrative data, and I am
kind of throwing it all in a bucket here at the moment.
Sample frame and frame coverage, not really a problem, obviously. Survey
data, you know, it covers the entire enrolled population.
Certainly need survey data, as pointed out, to know about the unenrolled
and potentially enrollable populations and take-up rates and all that kind of
stuff. So the survey data provides you with that, but there isn’t really much
of a problem from an administrative data point of view as far as a frame of the
population being covered.
It is important to note, though, that I did work – I was working with the
Veterans Administration in Minneapolis on a study that they were doing of
Post-Traumatic Stress Disorder, and I was doing a non-response analysis on
their survey, where they had sent out a questionnaire to people about PTSD, and
what we found was that we had a response bias when we looked at the
administrative data, that it seemed that people who didn’t have PTSD were much
less likely to respond. So we thought this was something quite interesting, and
then it turned out, when we dove into it, that it was the quality of the
contact information that was really producing this.
People who have PTSD received cash payments for having PTSD, and, as a
result, the quality of their contact information was very, very good. We had
good addresses. We had good phone numbers. We didn’t on the others, and that
was what explained the difference. There was very little difference after we
controlled for that factor.
So when you are using these things as administrative data survey sampling
frames in a way to mix – as another way to mix the two, we need to be careful
of those kinds of things like contact information.
Sampling error, of course, is not much of a problem. Could be if you are
drawing samples from the administrative records to use for research, not a big
Non-response error is a bigger deal or missing data. Certainly, I know the
Medicaid data the best that I have been working with. Item non-response on
those is a major issue, largely because, I think, item non-response here isn’t
the same as it is in surveys. The mechanism that produces it in surveys is
someone doesn’t give you a piece of information or they don’t know. So they
refuse or they don’t know, you know that.
I think it is maybe more systematically missing in administrative data if
it is missing for some reason, and that is an important thing to keep in mind
when you are using these data for research purposes. It is probably more likely
to be not missing at random versus missing at random in kind of the statistical
ease of things.
Age, program codes, race and ethnicity, we have been back and forth with –
The CMS people have been wonderfully open about problems with their data with
us, and, as a result of that collaboration, we have learned a lot about the
data, but it is important to know that this is – it seems to largely be
systematically missing. You know, some states are missing race and ethnicity
information, and others are – have very good race and ethnicity information,
or, at least, it is filled – the variable values are filled in.
So some of this data is missing systematically, TNF flagged by county, and
we found out – we wanted to use a TNF flag on the MSIS, the Medicated
Statistical Information System, and found out that it wasn’t really all that
good, because it was systematically missing. Race, ethnicity by state were
systematically missing in the MSIS, and some states had much more missing data
Identifying data can also – ID data can also be missing systematically,
which is really important for doing linking, obviously, and we need to really
do a good job and figure out where these data are missing in systematic ways,
and it can be a large source of sample loss for the merged data if ID data are
So administrative data have important information for health research that
is missing. I mean, that is the bottom line of the missing error, and I think
it tends to be missing more systematically, perhaps, in this survey, and so
maybe some of the techniques, like – imputation or things are not as possible
with the administrative data.
There are certainly measurement issues with administrative data. Certainly,
administrative data are, as we heard from Social Security yesterday, are the
standard for knowing whether someone is enrolled in a program and how much
someone received in benefits. There is no doubt about that. That is right on.
However, there’s other administrative data that is desired for research
that is on these files that may not be as linked to the program as – and may
not be as well measured, and there’s probably a lot of error associated with
these things, as Fritz brought up and other people have talked about.
Administrative data can be collected through many modes during more than
one wave of interviewing with several instruments used, and it is all kind of
mushed back together, okay?
You have interview – and the survey researchers will tell you all this
stuff matters – okay – that they do a lot of research into interviewer effects
and to self-administered questionnaires versus non-administered questionnaires,
and so you have all this stuff going on.
You have people completely filling out, where the interviewer actually
fills it out completely for the enrollee and then submits it and just has them
sign it, and, certainly, I have done that for tax information for people who
don’t speak English. I helped out quite a bit in doing that over the years, and
so I fill out the form completely for them and just have them sign it and send
it in, after they have given me whatever W2s that they have.
So you have all these kinds of things that are going on, and so it is
important to try to track that, as best we can, to try to figure out what the
source of the information is and create this metadata and do the analysis on
this kinds of stuff.
And interviewers have a wide variety of training and skills. For example,
you can have a tax – If you have an accountant fill out your taxes or you do it
yourself or other kinds of things, there may be quality-of-data issues involved
with that, and so it is important to be able to – You know, when you are
beginning to link this stuff, it is important to know about that when you are
trying to use these data for research purposes.
Medicaid enrollment data can be drawn from a wide variety of sources,
including county level, state level. You know, it is coming from all over the
place, and you have no idea how that variable got to where it is when it is on
the MSIS. It has gone through a lot of hands before it is into that MSIS, and
it is really important for people to understand that that is an incredibly
different situation than when the Census Bureau goes out and collects a survey
with an English version of the instrument and a Spanish version of the
instrument and so those are things that could be going on that are being drawn
from all over.
Administrative data forms. So the forms you actually fill out, are
generally not as user-friendly. I am always frustrated by them. That is one
thing that I think administrative people could learn – data people could learn
from survey people is how to get a form that is actually user-friendly and has
layout in an easy way for people to see the race and ethnicity information, and
they can circle more than one or they can fill in the box for more than one, so
that it’ll be comparable on the survey data and things of that nature.
So research is really needed into the mode effects and longitudinal panel
conditioning. As data is collected over time and things change, instrumentation
effects and all that kind of stuff can certainly creep into the data and it is
something that I think we really need to go in and take a look at. Survey
research has a long history of this kind of work.
I know administrative data has done it, but the thing about survey data is
you have these wonderful journals, outlets that get out to the public. You have
Public Opinion Quarterly, Journal of Official Statistics, and this work gets
out there. You have all these – and, certainly, America’s Fiscal(?)
Association, JSM meetings you have all these – you know, historical record of
the problems with these surveys that have been created, and it would be nice to
see – Certainly, some of that work is done on the administrative side, but
seeing this kind of work, looking at these kinds of questions, I don’t see very
often in the administrative data, and it would be interesting to see, looking
at like interviewer and mode effects and if there’s reasons to suspect that the
data may not be consistent.
Also, it is important to remember that people have different motivations
for filling out administrative data than survey data. Okay? I think that is
really key. You might want to have one income for your tax record. You might
report another one to your CPS interviewer. You might report another one to the
Medicaid agency, so you can get enrolled in Medicaid. There’s a variety of
these things. You can think of a creative caseworker in Medicaid being just
probably as good as a tax accountant at hiding income and knowing how to put
the family formation together, and so these are things we should be thinking
about as far as motivations of people for filling out these data when we begin
to look and see and cross classify by this stuff.
Also, if there is data that is not accepted in some administrative data
systems, so do data-entry folks just enter it, pass by that screen, even though
they didn’t ask or collect that information? So you are always curious about
that. So it is always good to check out if you can source out who put that
piece of data and how it got into that system. It is really a key thing to be
able to do research on.
So that is it, and, also, data editing and imputation. This is something
that is absolutely essential, I think. When we are putting together these
linked-data files, we need to have incredible metadata on them for researchers
to be able to use them well. There is very little documentation in the public
domain regarding the collection, editing, imputation procedures of
administrative data and enrollment data relative to survey data, and I think
that that really needs to – If we are going to link these things up and create
these files, the first thing we need to do, for researchers to use them
effectively, is to write the documentation. I know it is not anybody’s idea of
What I am doing, as a research project with the Census Bureau, where we
want to get to the answer, but we have put together a huge team of 20-30
researchers who have put a lot of effort into a project over two years, and all
that will sort of be left by the wayside, all the knowledge we collected along
the way, and we’ll just get a research paper out of it that gives the technical
results of what we are looking at, but we won’t have created the metadata, I
think, at the end of it. This is what is typically left. I mean, as
researchers, we just want to get to the results, and we want to pass along –
and not do this really tough part of documenting and writing this stuff up.
So putting these linked-data files together also means we have to create
the metadata to go with them. I think that is essential and needs to be done,
and all this kind of research needs to be taken – take place so we can do that.
So, basically, how does administrative data compare to survey data for
research purposes? Survey data, micro data and research into critical sources
of error are all in the public domain. Survey data are very strong, because
there are so many known problems. Okay. I have already talked about that, but I
do think it is an extremely important point that is often missed by
researchers, and similar research needs to be done on administrative data.
Certainly, the quality of administrative data will vary greatly from
centralized data collections or more centralized, like SSA, IRS or Medicare to
Medicaid or state-based programs, and so, certainly, I have been throwing them
all into one pile, but I expect that there will be great variation with respect
to some of those things.
The issues with the linked-data files that we have certainly been dealing
with over the last two years in the project that Ron described yesterday with
Census is there’s universe issues and measurement error on both sides of – the
administrative data side and the survey data side, and it is essential to
understand the differences and concordance between these data sources.
The universe issues, when there is missing linking information, that is not
good, right? So we need to really figure out why it is missing and who is
missing and how it could impact our analysis.
Do we have differential sample loss – missing ID – because someone refuses
to give their Social Security number on the CPS, for example, or could not – we
couldn’t find their Social Security number, so it couldn’t be validated? So we
have differential sample loss. There’s two real sources there. Ron showed it
was about 27 percent of the cases, I recall, total that couldn’t be linked.
Administrative data has missing linking key information, and it is
differential. He showed – I’ll show you here in a second that it was
differential, and it was systematic, and so there needs to be – when we build a
common universe, we have to do it carefully and figure out what was going on.
As you can see, here is a systematic. The red and black aren’t good for ID
This is one of those maps, I believe, that Deb was talking about earlier,
DR. SCHRAG: Yes.
DR. DAVERN: So here we are, and I actually didn’t know that was the name of
them either, but I have one in my presentation.
As you see, California and Montana, if you were doing an analysis, you
would have systematically missing data that could impact your analysis,
depending on how you are using those data. So it is key that you find out who
is missing and why when you are working with these linked-data files.
Developing these linked universes, there’s all kinds of reasons. This is a
slide Ron had, too. There’s not a valid record. They refused to have their data
linked, and then you have the big group of – Most people are in the big group
in both the MSIS universe and the CPS sampling-frame universe, but there is
also not a valid record. There’s people in group quarters. There’s people who
have died before the CPS interviewer gets to them, but they were enrolled in
Medicaid that year, and those kinds of things. Enrolled in more than one state
is another issue. Also, on the CPS side, you have births, people who were born
not in the calendar year that the data were collected for, but are included in
the CPS interview.
So you have measurement error here going on. There’s conceptual
differences. For example, a person can be on Medicaid, but not receiving full
benefits, so is the person really insured? That is an important question we
need to ask, and, in some cases, we determine that, yes, they kind of look like
they are getting a full range of benefits, some of these people who are partial
benefits. Others, aren’t. So you do have conceptual differences when you are
linking these files to think about. So they are on the MSIS, but do they
actually have health insurance as we think of it?
You have misreporting in surveys – person is on Medicaid, but reports some
other type of coverage or reports that they are uninsured, and, certainly, Ron
showed that yesterday.
You have misclassification of administrative data. Race data are often
missing from the Medicaid file, and are important for, of course, disparities
research, and when they are there, they may not be collected systematically in
every state the same way.
You also have systematically missing variables, as I have talked about.
So the potential for the merged data is great. We have talked about a lot
of these things. You have improving the accuracy of survey data using
enrollment data. You can improve the accuracy of sample frames.
One thing we haven’t talked about is the Census Master Address Files,
greatly improved by the delivery-sequence file, and, then, their relationship
with the U.S. Postal Service, and so those are great ways to improve Census’
sampling frames or anybody’s sampling frame is by looking to the administrative
Using merged data to create small area estimates was covered. Incredible
potential, I think, there.
Improved administrative data race and ethnicity information needs to be
done, especially for health disparities work.
There is great benefit to using information on imputation models and
editing from these linked-data sets that both the administrative data site and
the survey data site can use. So survey data can do better imputation and so
can the administrative data. Even if they can’t get the merged or linked data
back, they can learn a lot about their own data and the patterns of missing
So this stuff will greatly improve health policy simulation and health
research, and engage our errors. Don’t be afraid of them. Engage. Go out there.
Do the research and document them and try to get the best stuff out there.
And just to wrap this all up, there’s a couple of – We talked a lot about
problems. Other people have brought these up with recency and those kinds of
Certainly, these agreements all do their duty, and they restrict access,
and, as a result, we are dealing with old data, in a lot of respects. So that
is one of the limitations of this stuff. Hopefully, that will pick up and be a
little bit more timely.
Data are not in the public domain, and ability to conduct research into
quality of administrative data for research purposes is limited for people like
me, who like to do data-quality analysis.
So it is imperative that the agencies entrusted with those data really do a
good job for looking at data quality for research purposes not for
administrative data purposes, which are two different things.
So – and be careful, of course, about reaching conclusions based on
asymmetrical verification. Jill Colon(?) from AHRQ had this at the session we
were in, the joint statistical meetings about a month ago.
The example here is that we compare Medicaid enrollees who are linked in
the CPS to the MSIS. We know that they have Medicaid, but, in the CPS, they
don’t report that they have Medicaid or they report that they are uninsured.
We had 15 percent of the CPS data reporting that they were uninsured –
which is a problem. They actually had health-insurance coverage, according to
the MSIS in the past year.
Well, if you multiply that times the 40 million, simply, who are on MSIS,
you come up with six million, and you say, Well, obviously, there’s not 46
million uninsured in the United States. There’s 40 million. That is a dangerous
conclusion for a couple of reasons.
What is going on here is it doesn’t allow us to verify – the linked-data
files don’t allow us to verify if uninsured people report that they have
coverage. So we have only verified – we are only able to verify one piece of
the puzzle. We are not able to verify the others.
So we know if someone on Medicaid said they were uninsured. We don’t know
if someone who is insured says that they were – or someone who is uninsured
said they were insured, and I think it is likely to happen, given that there’s
10 questions that ask what type of health-insurance coverage you have, you
eventually give in and say, Oh, sure, and especially by the time they give you
the last one, which says, Are you sure that you are uninsured? So there’s
likely some of that going on on the other side.
So it is important not to jump to simple conclusions when you are doing
these analysis and when you are working with these linked-data files, and
taking a look at the sample loss is really going to be key.
So I am going to start back to where I finished, which is I think the
strength of the survey data is that it is in the public domain. There’s lots of
researchers taking a look at it for research purposes. We really know its
This is not true for the administrative data. So it is really imperative
that people who are entrusted with these linked-data files put together both
the documentation and the research on them, so that researchers can reasonably
understand the data that other researchers are using to inform the debate, as
well as the research that are used – you know, the data that they are using.
There are certainly all kinds of standards out there, data documentation
initiatives, one that I am familiar with, DDI standards for survey data.
Perhaps something similar for administrative data would be a good thing.
Research into sample losses is really key on these linked-data files, and
understanding the measure error, so –
And that is it.
DR. STEINWACHS: Thank you very, very much.
DR. STEINWACHS: We have time, now, both for questions and comments to the
speakers as well as to go into a broader discussion.
I wanted to invite people who are sitting back in the audience, we have
seats up here, and I would very much welcome you to join, and that way, also,
in making comments, you have a speaker in front of you, instead of having to
look for a wandering microphone. So please come up and join us.
Comments and questions to the speakers?
Is anyone ready to take the quiz to see if they know all of Friz’s friends?
MR. DENBALY: I have a question for Mike.
I am just checking to see if I understood you correctly. If administrative
data are used to guide the agency to run their program, why should the
evaluation of the data be for research purposes or did I misunderstand you?
DR. DAVERN: I think that the administrative data have been evaluated quite
well for their programmatic purposes, and so the programmatic data, I think, is
very good and solid.
There’s all kinds of other information that gets carried on these files,
like – that gets collected at enrollment time or other things that I think
should be evaluated, and those are the things researchers are really interested
in a lot of times.
The health-disparities researchers want to know about the race and
ethnicity data, but the people who administer Medicaid don’t necessarily care
all that much about that data.
As a matter of fact, when you enroll, some states, on their forms –
enrollment form say, This is a completely optional question. Like I think New
York has one, and so they have incredibly missing data on their Medicaid files
on race, ethnicity.
Other states just have a blank that says, Race, fill it in, and how that
gets recoded into the MSIS, I have no idea.
So those kinds of things are, as far as the quality of – So those are the
things I am talking about for research purposes.
The administrative data are very good at administering programs. There is
no doubt about it. They have been properly evaluated, but I do think it is
something different to think about from a research perspective – from a health
research perspective. They need to be evaluated for that purpose as well.
MS. GENSER: Hi. Jenny Genser, again.
I wanted to give a case study on some admin data that I work very much
with, and that is the Food Stamp Quality Control Data, which I have been
working with for about 15 years, and what I have found with the Food Stamp
Quality Control Data, what it is is administrative data. It is a sample of Food
Stamp recipients that we use to measure payment error, and we found
consistently that the data that are required to determine whether a person is
receiving the correct benefit is generally quite accurate, but other variables
that we have collected – such as race, ethnicity, citizenship status – may not
necessarily be as adequate, because it doesn’t effect the QC error
And our office has worked a lot with the program to make sure that these
data are higher quality, because we use it so much for our analysis for a cost
estimation for researchers, and, in fact, it is data that is on the public
domain with documentation.
Example might be with citizenship status is a few years ago we were finding
a lot of people who were – the race was coded as a Native American. They lived
in Oklahoma, and their citizenship was naturalized citizen, which we knew that
it didn’t make sense.
So that is just a case example you might want to just – if you are
interested in looking at quality of admin data that the Food and Nutrition
Service has done a lot of work with the Food Stamp Quality Control Data.
Now, in terms of these overall Food Stamp data, that is run and
administered by states and counties. So we don’t know that quality.
DR. BREEN: Just to build on that, when I was talking with the Social
Security Administration Quality Control person a number of years ago about the
possibility of using that data and matching it with the SEER registry data, she
was – We were discussing – I thought it would be good to use IRS data to get
the tax information for information on income and then the Social Security
information to get information on earnings and that that would give us a pretty
decent picture of people’s economic well being, which we had nothing of on the
So we were discussing that and she had been there a long time and mentioned
some of the issues, but one of the things she said was that before Watergate,
when their data were routinely analyzed, especially by people at the Bureau of
Economic Analysis, she said, Our data was in much better shape, because the
researchers would analyze the data and they would come back to us and they
would say, Well, you’ve got a problem here because this data or that data is
not very accurate.
So, in fact, the administrative data – and I want to just put this on
record for the committee in terms of our recommendation and cost to the public
of not doing this – the administrative data gets better by researchers outside
the agency using it, and it gets better even for administrative purposes.
And I don’t know if Deb wants to mention this, because I know she has been
using, and Gerry, too, the SEER Medicare data, which has improved over the
years as a result of researchers working on this data. It is better for
research. It is better for administration.
MR. LOCALIO: Michael, I commend you for your comments about metadata. That
is not often discussed.
I just want to let people know that it is not just an issue for the type of
data we have been talking about here, whether it is administrative data from
various agencies – It is a terrible problem with – and clinical data that you
get on people’s health.
Different organizations collect the same data in very different ways, and
it means different things, and if you think you can use a nationwide
health-information network and aggregate data that way and everything’ll be
defined the same, you have much to learn.
So I think – I wanted to comment that the metadata issue, the lack of
documentation, the lack of understanding of how things are collected,
generated, is a real problem, no matter what the source of data is, survey,
administrative data from agencies or whether it is clinical data.
MS. TUREK: I’ve got a question, I guess, more or less.
There really are kind of two types of state data. One is the data that the
state collects – chooses to collect itself to serve some function it is doing,
and the second is the kind of data we feds tell the states that they have to
collect to get benefits from a program like TANIF.
Many years ago, I used to work with school-district data, and Department of
Education used to ask for variables. They didn’t get a lot of what they wanted
for political reasons, and, I mean, the guy who was the head of the Council of
Chief State School Officers told me that any time the feds tried to collect
data to valuate them, they were going to be up on the Hill getting that data
out of the system.
So I would suspect that the data that a state collects because it wants to
and it is serving some kind of state function, like the Vital Statistics,
probably has a better chance of being better than Medicaid data, and it might
even have a more rational set of variables, and I would like to hear somebody
who knows more about this than me talk about the differences.
DR. STEINWACHS: Anyone care to respond to that?
Makes sense to me, but be nice to have –
MS. TUREK: Have you got any ideas, Jennifer?
MS. MADANS: Well, I guess I would agree with the general premise, but maybe
Vital Statistics isn’t the greatest example, because states vary very, very
much in how they use the information, other than the registration information.
I think so much of what Mike was saying that the -if you look generically
at the vital registration system, the registration information is very good,
because that is really why it is collected, and there is a lot of emphasis on
that, and that is what states use.
Some states use the other information that tends to be more of an interest
to researchers, partners and to some states, because they have more programs
that deal with that, and so there is variability in the quality of that.
I think the bottom line is don’t assume any data are good until you have
done an evaluation and then publish that evaluation.
So I completely agree on the metadata and spending the time to do the
MR. IAMS: I am Howard Iams from Social Security.
I have worked with AFDC Quality Control Data and Social Security data, and
my conclusion is if it is used to administer the program to calculate a
benefit, it is usually pretty good, and if it is not, you are using it at risk.
And the QC system that I worked in for six years – three years – states
would code whatever they wanted in the fields they didn’t care about.
Regardless of what the coding form said you were supposed to put there, they
would collect some data item they wanted and just stick it in the field.
And at Social Security, a lot of care is given in calculating eligibility
and a benefit, but some of the other information is not going to be as
In terms of mortality, in Social Security, we have several sources that are
reporting someone is dead, and when we pay a small amount for a death
certificate, and the funeral director is sending in the report to get the money
that we pay, because it is signed over to the funeral director, that is pretty
When the states are sending in reports, which we pay for as well, they are
not quite as accurate, and, actually, it costs us money when we stop paying
that small amount of money that goes to the funeral director, because someone
doesn’t report a death and we keep paying benefits, and Congress, to save
money, will come along and eliminate that benefit. Why pay $250 to report a
death? And it will – They did it back in the ‘80s, and it cost a lot of
money, and they went back and started paying it again.
MR. RILEY: I guess one – situation with the Medicare data, on the claims
data, the conventional wisdom is that if it doesn’t – if their data are not
reported on the claims to set a proper payment rate, then you probably
shouldn’t trust it, but, on the other hand, data that are reported for payment
purposes can be gamed as well.
The provider has an incentive to try and upcode and do things like that
that might increase their payment amount. So you have different incentives at
work as to how seriously people take the information and if they are trying to
game it to some advantage, so that you definitely need to consider the
incentives of the person who is providing the administrative data when you go
to use it.
MS. TUREK: I am going to ask a very naive question, then. If, in fact, we
can only trust the payment data, what do we gain from putting the
administrative data in if we don’t know if it is very accurate? I can see the
address files, but on the MSIS data, the enrollment, yes, but is that all we
should use off of these administrative data files is the enrollment data?
MR. RILEY: Well, there has been work to try and validate some. I mean, you
have different levels of accuracy for various elements that appear on the
claims and enrollment files.
For example, there has been work to look at diagnostic information that is
reported on physician claims, and that is not used directly for payment
purposes, but I think the consensus of the validation studies on that variable
is that it tends to be accurate enough to be useful, but not anything you’d
want to make a life-or-death decision on.
So, I mean, there’s levels of –
MS. MADANS: The hospital diagnosis –
MR. RILEY: That is used for payment purposes. Again, we have noticed, over
time, there is a certain creep.
DR. SCANLON: I was just going to agree with his point that, I mean, it
really comes down to the quality of the data at the individual variable or
field level. I mean, because – and this is my conventional wisdom, too, for
years, that if it wasn’t for payment, it was potentially sort of unreliable,
and, then, as you pointed out, we have created incentives for how you report
things for payment, to the point where I would say it is more than conventional
wisdom. We have an empirical test.
I mean, the introduction of the prospective payment system for hospitals in
1983, when it suddenly became sort of such significance for payment, we saw
this dramatic shift in diagnosing reporting, and the story was, well, they had
become better at it, in terms of capturing all the diagnoses that sort of were
associated with an individual, or, potentially, as Gerry says, they recognized
the value in sort of reporting sort of things that maybe were marginal to begin
In terms of if this – we have this problem, I think, this is kind of, I
guess, my sort of where I come out is what is the purpose – I mean, what is the
use we are going to apply this to, and sort of how much data error can we
tolerate? Because, in some respects, we are worse off operating with no
information. We have to think about sort of there’s errors sort of within this
information, but we still may sort of improve our decision making if we use
that data and we take that error into account.
Policymaking sort of always has to recognize that it is operating on a set
of information that is flawed, but it needs to still move forward.
DR. STEUERE: I have a comment, then a question.
The comment is, Fritz, they say is what age is one’s memories turn to
myths. So I appreciate the fact that your memory of my involvement had turned
to a myth as to what I had actually achieved, but thank you anyway. It was a
My question had to do with incentives, which is where we were going just a
As an economist, any time I see something where we feel like we are doing
something incompletely or not as well as we can, I am always brought back to
the question of what are the incentives of the system, and I wonder if the two
of you – either of you – have some recommendations on how we actually change
the incentives of our various systems to reach some of the goals that you
mentioned, and I’ll mention one, but you could go into other areas.
My sense is that when it comes to data, we of social science have it
totally upside down. If Madam Curie tried to publish an article or had tried to
publish an article in any of our social-science journals, she would be
immediately rejected because all she did was gather some data and run some
And my sense is that the people within our statistical agencies who would
gather data, put it together, document it, do all the things that we have
talked about receive almost no rewards in our academic settings, even though
one could argue that they are the ones doing the basic research, and those of
us out there who are using the data sets are really the ones doing the applied
research, which actually gets the – especially in academia or in our journals –
tends to get the credit.
So I guess that is one area to go, but I guess within the agencies, too, I
am just curious whether either of you have given any thought to are there
recommendations that we could think of as a committee that says to HHS or other
agencies says, Here’s at least a couple of areas where you really need to think
about incentives of the system?
I think one you mentioned was bringing in outsiders occasionally to
constantly review the data sets as they are developed or however we do it for
whatever reasons, because they are doing research and that is their incentives
or because –
But I am just curious whether – just thinking about incentives, do you have
DR. SCHEUREN: Well, you heard yesterday from Ed Sondeck(?) about two things
he thought out to be done, and I’ll repeat them again, if I got them right.
Despite the fact that I have aged considerably between yesterday and today,
The first of them was this notion of having a small fund to get analysis
And the second one – which could be done by an outsider – and the second
DR. STEUERE: Could you just expand on a small fund for whom to do what?
DR. SCHEUREN: Well, let me repeat – Let me –
DR. STEUERE: (Off mike).
DR. SCHEUREN: He was talking about for NCHS, right? That he would be given
a small amount of money that he could spend at his discretion – okay? – to
bring in outsiders to work on the data, and he would also be given resources to
have internal analysts work on the data, which is what I think you were
alluding to just a moment ago.
That would be a fundamental, and it would be earmarked, and it might not
come from the agency. The problem with it coming from the agency is it gets in
the base, and then they squeeze you again. So it has to come from an NSF or
some place outside that, outside the base. Otherwise, it, in the end, doesn’t
matter more than a year or two.
That is a very good idea. I thought it was an excellent point. He made it
yesterday afternoon, and I would like to see that followed up. Talk to Ed and
develop it more. You asked me to develop it, but you were sitting next to him
when he said it, so – I know you don’t remember it, but – I am teasing you now,
You don’t remember a lot of the good things you did, Gene. That is the
problem, because you are such an humble guy, and I am a student, so I try to
write down what you say, so that I can repeat it or recall it at various
points, and I hope I did a pretty good job of recalling it.
I think that – I want to make a point.
DR. STEUERE: You said you had two things –
DR. SCHEUREN: And staff, and increase the staff to do analysis. Earmarked
staff to do analysis –
DR. STEUERE: Earmarked – (off mike).
DR. SCHEUREN: That’s right. They have to be earmarked, because you get a
crisis – And if you are a producing organization, there are always crises. I
mean, it is just unbelievable. There is always a crisis, because the systems
are so structured that they are always at the point of a landslide, if you know
your complexity theory. They are always just at the edge of one more thing
wrong – okay? – and the whole thing goes. Okay? And it is really true. Right,
Ron? I mean, it happened to me a lot.
I have a story about one of the censuses in which the IRS was supposed to
produce a piece of information – it was an industry(?) code – and it wasn’t
being used for administrative purposes, and so one really smart service center
director who was trying to save money, says, We won’t key that.
Well, we didn’t catch it for a while. That was a big mistake. Big, big
mistake. Shame on us. Shame on me, in fact, because I was supposed to have seen
it done, provided some oversight, and it was too late when we found out about
it. Too late. Too bad.
We didn’t have the right systems to monitor it, and everyone was trying to
save every penny they could and beat the other guy in the next service center
over, and bad mistake. Bad mistake made. It really wrecked the Census Bureau. I
apologize again publicly. I apologized to many people at the bureau. I’ll
apologize again publicly to the bureau people here.
I want to make another point that was underlying something that my good
colleague said about metadata and about a subset of metadata called
paradata(?), which is about the process of the metadata.
Actually have written about this and given talks on this, and I think it is
fundamental to making the linkages learning systems. If you don’t do this –
okay? – then, you are gone.
Now, there is really good software out there to do this. I don’t know of
any statistical agency in the U.S. Federal Government that is using it. The
Canadians are doing it, okay? Some private-sector firms – I think RTI is doing
it. We are doing it at Newark(?), although not in a very large way. It is being
done at Brigham Young University. They run exit polls since ‘82 or
‘84, and they do a wonderful job of documenting things historically.
I mean, the real problem with documentation is when the person who did it
isn’t there anymore, if you can’t understand it anymore, if you don’t have the
Rosetta Stone somehow, it is gone. Very important, and when Howard talks about
these long-term systems where you add years and years of data – okay? – and the
initial starting point – however good it may be, wherever it came from – the
Census Bureau or somewhere else – is not well documented, you really have a
serious problem, and, moreover, if you are documenting these administrative
systems well – okay – if you are documenting them well, which means you have to
have money to do that and time, then, you will discover errors quicker.
If we had in place the system that I would be recommending today, we
probably wouldn’t have made that mistake for you those years ago. We’d have
made some other mistakes for you, but not that one.
MS. GENZER: I wanted to go back to the Food Stamp quality-control example,
because Howard Iams was talking about problems which I have seen, too. If it is
not connected to the eligibility benefit, it is not as accurate as one would
like, but agencies don’t have to stand helplessly and wring their hands because
you can improve the accuracy, even of those data.
We have a contractor, Mathematica, who, each year, takes the
quality-control data and edits it, and does data assessments.
What our agency does, then, is meet with the program staff, interagency,
and says, Look, Houston, we have a problem. We’ve got naturalized citizens who
are Native Americans in Oklahoma. Does that make sense to you? It doesn’t to
So, then, what we did – and this was in ‘01, ‘02, ‘03 time
period – we had meetings with the regions and then with the states, talking
about these are all the areas that we have problems with. We were not getting
good information about vehicles.
Well, one of the reasons, we found out, was that states that have exempt
all vehicles or exempt one per household from the vehicle test from the Food
Stamp Research Test, they weren’t collecting vehicle data, if they didn’t need
We realized in our office that, since we couldn’t rely on that data, we
said, Okay. We’ll drop this vehicle in exchange for having an assurance that
the states will make some improvement on the quality of the immigration data,
which we have to have to do our analyses and our reform simulations.
So you can work with the states who are putting together the data to get
higher quality admin database. So there are steps that you can do to improve
the quality of your admin data.
So our agency tries very hard to make sure it is available, and part of the
reason is that it is in the public domain, and if you could go home tomorrow
and download the Food Stamp Quality Control Data from the Internet, since 1996,
DR. SCHEUREN: Compliments to you.
MS. GENZER: Yes. And the documentation, too.
DR. SCHEUREN: Yes.
DR. DAVERN: I just wanted to – The incentive question was really
interesting, and, certainly, what has helped us get this project underway was a
combination of things.
First, it was the Assistant Secretary for Planning and Evaluation, Mike
O’Grady, was very interested in our project with Census, so that got the HHS
side of people interested, and then Robert Wood Johnson kicked in some money,
but – and so it was money and it was power, basically, both, that brought that
together, and brought together a team of a lot of people working on this
particular project. So that was the incentives there.
But I do think that the best way to improve the quality of the data is to
use it for research, and as soon as it starts getting used for research, these
things are going to get done. So the more – Certainly, through access – Because
the data-collection agencies, the administration data-collection – the
administrative agencies are going to – They don’t like to make mistakes, and if
you can show them or document the mistake, they are going to fix it. That is
DR. SCHEUREN: They don’t like to be embarrassed publicly.
DR. DAVERN: Well, right. Well, they don’t like to be embarrassed publicly,
and if you work with them and you are reasonable about it and understand that
mistakes happen and you go in and you improve the system. That is what we are
trying to do. You constantly want to work on improving the system.
So, in some way, whether it be through – Certainly, NIH could fund
researchers to do this stuff, perhaps giving subcontracts to Census or NCHS.
You have to actually work with the data, because we can’t see it or if it can
be done at a research data center where we can actually get in and have access
to it, and a Title 13 benefit, for example, would be producing metadata.
So if you go in and work and do the research with this stuff, you’ll
provide it, too. You’ll have to provide a piece of that metadata.
DR. SCHEUREN: Good. I like that. That is very good.
DR. STEINWACHS: Very good.
MR. PREVOST: Well, I was musing. There is probably smoke coming out of my
ears, this late on the second day of a conference.
But one of the things I was thinking about – and, often, as researchers, we
don’t think in this way – is that as you are looking at data from either your
own organization or another organization, you compare them and you say, Hum,
something is different here. It is not working, and somebody is – quote/unquote
– wrong, whatever that means.
When you do find that things aren’t matching up and you suggest a change to
one of those organizations, we need to think about not just a statistical
environment, but we need to think about how can we measure that agency’s or
that entity’s return on investment, okay?
If we can start showing that, yes, when Researcher X accessed File Y and
completed this research that they were able to reduce the operating costs or to
improve the measures quantitatively that an agency is producing, I think it
would be a huge benefit to be able to continue this type of research and to be
continuing this type of linkage activity.
DR. SCHEUREN: It is important, though, to have this candid tradeoff which
the Food Stamp example was beautiful. I am going to give up something. I have
two bad things. I am going to keep one and make it better and give up the other
one. That is the right –
I mean, sometimes, you can improve the operating efficiency of an
organization by simply focusing on a weakness, and that certainly happens, but,
often, it is weak because you are just not spending resources on it.
MR. PREVOST: Well, yes, and just to add to that, I mean, how often have you
worked with an agency and they say, I don’t know why we are collecting this
piece of information. It has been on the form for the last 25 years, and you
finally say, I really don’t need this. You have automatically improved their
efficiency, but you need to capture that, so that when you go back again or to
another agency, you can start using these benchmarks to show them that, yes, it
is effective to do this work.
MS. MADANS: Kind of expanding on what Ed was talking about yesterday and
what has been brought up, I think it is true that there is not a lot of glory
in doing some of this basic work, telling people what is wrong, and it doesn’t
make you popular, and it certainly doesn’t get you articles in JAMA.
But I think a lot of the agencies are trying to incorporate more
methodologic program, and what they need to hear is that that is useful, and
useful to users, useful to the people who are advising the Secretary.
I mean, there has to be some reinforcement, because, in order to do it, if
you are not going to get new money – which, of course, we would all rather have
new money – then, you are going to be not doing something else, and so you also
then don’t want to get bitten by saying, Well, why are you not doing X? Well,
because we are doing quality stuff, and I think we are moving a long way to
providing metadata, especially the Web. It is much easier to do it.
We used to try to do these reports. They were hard to do. Well, now, we
have put all the metadata, all the quality stuff up on the Website, it is so
But there has to be some acknowledgment from someone that this is an
important thing to do.
We would like – I think – what Ed was really talking about was a grant
program, which is a nice thing to do, because you can bring the outside – the
We like to have IPAs come, because, then, they can have access to things
that they can’t have other places. We have a senior fellowship through ASA.
These are the kind of things that we would like to have external input on, but
it is a cost, and if there is no changing that incentive, if somebody is not
saying, This is a good thing to do, people are going to say, Well, it is not
considered important. It is more important for me to go out and –
It is more expensive to figure out that I don’t need that item that I have
been collecting for 25 years and no one is looking at it. There is a cost to
doing that. I might as well just keep collecting it and everyone will be happy.
DR. SCHEUREN: There might be a stakeholder, like the Census Bureau, that
needed it, too.
MS. MADANS: Well, I’ll tell you, every time we take something off that no
one has been using for 25 years, somehow –
DR. SCHEUREN: Somebody needs it.
MS. MADANS: – the person who wanted it, has my home phone number.
DR. STEINWACHS: They may be resurrected from the graveyard, right, Fritz?
Some of those people coming back.
DR. SCHEUREN: That is another project I worked on, yes. We won’t do that
DR. STEINWACHS: We have talked a little bit about what might be possible
areas in which the National Committee on Vital Health Statistics could make
recommendations, and, as you know, those recommendations are made to the
Secretary of DHHS, but those recommendation letters also go more broadly, in
terms of being shared, and so some of those recommendation letters in the past
have made it clear that the value and the idea that DHHS might take the
leadership of bringing together multiagency groups to try and address specific
issues, and so in areas of consolidated health informatics and some other
things, there have really been multiagency activities that have gone outside of
So I just wanted to have a chance here, if there are other areas that you
would like to identify where you think the capacity of this committee to make
recommendations would be valuable.
DR. SCHRAG: I just want to bring up the issue of death certificates. So
mortality is a great outcome, and we all love mortality, and it is measured
DR. STEINWACHS: I don’t like that outcome whatsoever.
DR. SCHRAG: Okay. Well, I mean, dead is usually dead, but we all pick up
the newspaper, and, not to mention journals, and look at issues of
cause-specific mortality, which is sort of one of the key vital statistics, and
death certificates in most states basically have proximate cause, contributing
cause and underlying cause, and it is one of the most misused statistics are
cause-specific attribution. It is an area where there is total anarchy, very
little guidance, very little in terms of metadata files, very little in terms
of training the individual – I mean, this is something that resides at the
level of the individual providers.
I will speak as a physician. I code death-certificate data. I started doing
it the first week I graduated from medical school. Those are the people who
code death certificates for deaths in hospitals.
You are called to see a dead person. You don’t know anything about them. It
is two o’clock in the morning. You know what the easiest thing to do it? It is
to write down, Proximate Cause: Heart Attack. Contributing Cause: Edema,
pulmonary embolism. Underlying Cause: Cancer, or – based on what floor the
patient – It is done terribly. It is done terribly all over the United States.
It wouldn’t take that much – There are not even documents. There is no
training for physicians.
MS. MADANS: There is. There is. There’s a lot of –
DR. SCHRAG: There is documentation, but there is no training at the level
of the people who are filling them out.
MS. MADANS: We should talk. There is training. We develop a huge amount of
I am agreeing with you. There is a lot to be done and there is a huge
literature on death certificates and the problems with death certificates and
cause of death, especially in the –
DR. SCHEUREN: Your mike is not on.
MS. MADANS: Yes, it is on.
DR. SCHEUREN: It is? Okay.
MS. MADANS: I mean, I think it is a good example where there is
information. We try to – This is a state issue, because states are in charge of
death and birth and marriages and divorces, but it is how do you get – when the
source of the data – you have no control over – How do you improve quality? And
that is happening in all of these administrative systems, because you have no
control of the person who is actually filling it out, and, I think, in birth
certificates and death certificates, there actually is a lot of work with
coroners. We go to conferences. We do the training, but if it is not in medical
schools, the medical schools don’t want to hear about it. They are too busy to
worry about death.
It is very hard to get that dialogue going, and unless – There has to be
kind of this joint agreement that this is worth time and effort, and how you do
that, I don’t know.
DR. SCHRAG: And I guess – I think – Yes, it may be at coroners, but it is
not at medical schools or medical providers, and I think the way to get them to
pay attention is to incentivize them by providing them with data back.
I don’t think it has to be monetary incentive, but they actually care about
the data, and if they see what the data can do for them, then, all of a sudden,
I think that you would see even more in terms of input into training, because
it is an absence. Maybe some states are really good at it.
DR. STEUERLE: This last discussion, let me say, I do think that is an area,
not just the subcommittee, but the full committee has recently decided we may
need to engage. We did have some discussion on vital statistics and even the
issue of whether – this is also an issue for us – is whether we need something
like, I am thinking like the multistate tax commission, where it is not tax,
but whether the state-level attempts to develop protocols and standards and
stuff like that are adequate.
I don’t know enough about it to say what could be done, but I think it is
something we should probably proceed to do. Although, I am somewhat a follower
of Woody Allen, who said, I don’t mind dying. I just don’t want to be thee when
The question I have is, with respect to ways of – I mean, getting back a
little bit to incentives. I am thinking of a couple of gaps, and I don’t know
fully how to approach it, but let me base one on experience like Fritz and I
have had at IRS, and that is with respect to data that is in one agency that is
not related to their mission that no other agency really has a strong incentive
to get at.
So I am thinking this may sound minor to people here, but IRS has data on
health savings accounts.
Now, I don’t know how much effort they are or are not putting into it.
There are people, for health reasons, who have a very different interest in
developing data on health-savings accounts and whether people who use
health-savings accounts get the same healthcare as others, which would be very
different than IRS’s concern, which might be whether people are paying their
taxes adequately or IRS’s data on the use of deductions. I don’t know whether
deductions, which are probably concentrated on things like nursing-home care,
would be related to somebody doing research on nursing-home care.
Certainly has, indirectly, data on – to the extent it is listed on
employers – payment for health insurance.
So there are these data sources with someplace like IRS, but my sense is
there is very little incentive for HHS to say, Gee, we’ll provide 10 percent of
our health research funding to IRS to go do this data, and yet it could be
within IRS there is some really, really valuable information, and I used IRS,
because I used it as an example, but I am sure this is true across a lot of
agencies. It could be that adding a health variable to one of these educational
longitudinal studies could really help us to understand the extent to which
education outcomes are good or bad.
To the extent you can talk the agency into doing it, it is one thing. To
the extent you actually provide funding to other – interagency funding is
another, and I know there’s been minor examples of BEA, through – a lot of
pressures funding, say, IRS to get data, but that is because IRS has the core
data to develop national income statistics. So you are at that level, sometimes
you can get interagency funding.
So I was just curious – So that is one area where I wonder whether we need
to examine incentives.
And the other one has to do with whether we do come back and do
You know, you mentioned several times the improved quality of Food Stamp
data, but I know for certain debates I am in, the quality of the data is very
misleading. So I am in this debate all the time of whether income credit has
higher error rates than Food Stamps, and there’s always these – Well, Food
Stamp error rates are down 5 or 10 percent.
Well, the reason Food Stamp error rates are 5 or 10 percent, they have a
monthly income-accounting system. They essentially have abandoned that and
said, Well, we are going to assume that, for compliance purposes, if you
reported your income right one month, it is right for the next six months.
It is also accurate mainly if you are talking about people who don’t have
earnings, so people who aren’t in the workforce at all, sometimes your earnings
– your accuracy of your earnings record is correct, but it is very inaccurate
if you get to the question about the three-million missing people in the Census
who are often in these Food Stamp households, depending how you measure
So there is this argument that – this is a big policy debate – the
earned-income credit is a much worse measure than Food Stamps, but I think
actually – at least I look at the data, I don’t think it is, but for the health
data, there is probably the same thing. How do you get outside researchers to
come in and be critical – I mean, the agency has very little incentive to have
people come in and be directly critical about the way they are doing things,
and I am just curious the extent to which we need to raise that issue as well.
So I give two examples, to summarize. One is how do we get outsiders to
come in and do critiques? How do we pay outsiders to come in and do the
necessary critiques, when the outsiders don’t have a strong incentive system?
And, secondly, how do we get agencies to do more cross funding, where it
may be very useful – another agency has the data they really like?
DR. SCANLON: I wanted to just sort of comment, because of an experience
with the – potential experience with the IRS that never sort of happened, and
kind of with the principle that – recognizing that the unique role of the IRS
as the tax collector should be taken into account, and that one needs to sort
of approach that question, because that is a controversial role, in terms of
In HIPAA, one of the features of HIPAA was the medical savings account,
which was the predecessor to the health savings account, and it was actually
the thing that held up HIPAA for a long time, and there was finally a
compromise allowing for a demonstration program to go forward with 750,000 sort
of people that were going to enroll and an evaluation to be done or be
contracted for sort of by GAO.
And when we thought about that evaluation, we realized that the only way we
were going to identify sort of the people easily that identified people with a
medical savings account was to go to the IRS, and the issue was sort of what
were the sort of patterns of use among sort of people with medical savings
accounts versus other parts of the population, because the people that oppose
medical savings accounts say that people are going to forego sort of important
needed services because of the high deductibles and we are going to get sort of
poor health sort of as a result.
We rejected that sort of as an approach because we couldn’t imagine sort of
having it leaked that the idea of, Hello, I am from here – from the federal
government to ask you about your health. I got your name from the IRS, and that
is the kind of thing, I think, that we – it was not a good application.
Now, I think there are other applications that are potentially good. I
mean, it is a question of whether, within the context of information that the
healthcare, but, at the same time, there’s limitations on what we can do, and I
think what we need to do as we move forward is to think about sort of how do we
give advice about how you maximize sort of the benefit without incurring the
We were worried about a backlash in that regard. This was 1996, okay? The
climate was even sort of more sort of, I guess, fractionated, sort of then than
it is today. There was a very sort of strong sort of political sort of
atmosphere that – and we were worried sort of about what would be the response
And so I think that we need to sort of find – I mean, as we give advice to
the Secretary, find sort of the directions where there’s some safety in the
path, I mean, because we can sort of make some progress forward, but if there
is going to be a response that is too negative, we are going to be set back for
too long and it takes too long to recover from those setbacks.
MR. PREVOST: Just an idea, and perhaps this is coming from someone who has
been head down in the trenches for too long.
As I look around, one of the things – I love coming to these conferences or
conferences like this because I find out about all sorts of data that I never
even knew existed before, and I find that as a concern as a person who had
operated as a statistician – now, I am just a manager – but – and had to solve
real-world problems and really didn’t know that these resources existed.
And, furthermore, are the data available? What can be done with that data,
and is there some possibility that there could be an interagency working group
that could be looking at both data availability and linkages and measuring the
quality of those linkages at the working-person level, not a bunch of us
managers around, but those people who are statisticians who have to solve
real-world problems that each agency, frankly, has?
And the incentive here is is that they would be working together to – as a
problem comes up with Agency A, maybe you’ve got an interagency group that
could be focusing on that, looking at specific files, so that each agency that
is participating in it could see that as a shared resource upon which that they
Now, this may be a utopian view of the world, but it is just a thought,
something for you guys to ponder.
MS. OBENSKI: I guess Ron and I have been working together too long, because
I was kind of converging in the same direction, and that is, I guess, that I
have never actually been to a conference like this, and I think it has been one
of the most remarkable things that I have experienced, in terms of making
progress in record linkage.
And I guess where I was thinking was trying to use kind of like the
Medicaid undercount models, like what are the big problems you are trying to
solve that could warrant record linkage, and then building on what Ron said,
bring the right groups together, and something like what we did in the SHAY
DACK(?) or the Medicaid undercount model, because that is where the incentives
The incentives are what is a big problem and what are the pieces of it and
what are the different views of the different agencies and what do they have to
gain from it. Because we’ve got states involved that are willing to be involved
in this project. We’ve got ASPI(?). We’ve got CMS, and, as I said yesterday,
what is remarkable is everybody is coming at it from a different – for a
different reason, a different agenda.
Ours is improved statistics. Mike’s is improving the CPS, so he has better
data to do his research, and so I think that that would be a tremendous outcome
of this group.
DR. STEINWACHS: I have recorded it. It has been fully recorded.
Other ideas, suggestions?
Sally, you were mentioning states, too, and I think one of the themes that
has come up over and over again is are there better ways to work with states
than we are doing now or maybe more integrated ways that bring multiple
agencies together to work with the states?
MS. OBENSKI: I think that our experience, albeit limited, is that I think
that whoever brought up the question about incentives, I do think that there
appears – and this is just an observation – my observation, not the Census
Bureau’s – is that there seems to be, from what we have experienced in working
both with state folks and with federal folks on different projects, is that
there are competing agendas in terms of what the states’ incentives are in
administering their program and what the federal incentives are, and that is
very, very important to understand how to bridge that, because big federal
programs that allocate to the states need the state’s help in administering the
program, but that is just an observation that I think needs to be addressed
before we are really going to make these two pieces fit.
DR. SCANLON: I have been here thinking that we potentially need some kind
of process for reconciliation.
I mean, I am guilty sort of, I think, over the last two days of being kind
of the nay sayer in terms of quality of information and sort of saying we have
to worry about the quality of information, and I’ll tell Deb that, almost on a
weekly basis, I do say, Don’t let sort of the perfect be the enemy of the good,
and I still believe that even though what I have said here over the last couple
But it seems to me that sort of with respect to quality of data, with
respect to privacy, there is an issue of the tradeoff between that and the
social benefit we might get sort of from sort of moving forward, but I guess I
am concerned that we don’t have – necessarily have a situation where the
decision maker is weighing those two things.
The decision maker may be approaching this from sort of one perspective,
okay? Their job is that, under statute, they are to protect the privacy of sort
of all of the respondents that are in this data set, and they are moving toward
the point where the probability of disclosure is .000001 – okay? – and is that
sort of where we want to be from a social perspective? And the answer is,
It would not be unreasonable, potentially for a government. It would not be
sort of an arbitrary and capricious thing for government to do to increase the
percentage of disclosure to sort of have three fewer zeros sort of in front of
And the question is how are we going to get there? How do we get there in
terms of weighing sort of these benefits versus the risks or the – of either
sort of privacy or of poor sort of data sort of leading to erroneous
And I don’t know whether the IRB model, which was suggested here, whether
there is some variant of it that we could think about sort of in government or
whether we should think about somebody who does – in some respects, serves as
the arbitrator, hears both sides of this and comes to some conclusion as to
what is the appropriate sort of tradeoff, because all the things that we do
with respect to statutes, all the things we do with respect to guidelines are
still going to have subjectivity in them. Somebody is going to have to make an
interpretation saying, This is reasonably consistent with that statute,
reasonably consistent with that guideline. It is never going to be black and
white. It is going to be something where you can say, Okay. Clear case. The
clear case is not to do anything, which we know has incredible social –
So that in trying to atone for my negativity about sort of quality, I
wanted to sort of offer that as a process way of dealing with moving forward
for the future.
DR. STEINWACHS: You have atoned well.
MR. DENBALY: In terms of process, and in the context of everything that we
have been talking here, suggestions that are unmade, working groups are being
put together, and perhaps one project can be picked up and used as examples.
One that I have in mind is one that Ron listed as the first one on his
slide, high-valued future research project, and that is the connection of
NHANES(?) to WIC and Food Stamp data, in the context of health.
I think what we eat is probably one of the most important things. We need
to understand why we eat, where we eat it, how we decide how much to eat and so
on, and, in particular, we are spending over $20 billion on Food Stamp and
other programs as such. We need to understand the consequences of these
programs and even the administration of these programs. So studies, as such are
highly policy relevant.
So we have this group in here and a lot of the issues that we have been
dealing with is what you have been talking about, states and providing the data
and linking up with NHANES is very tough.
So I am suggesting that this group is definitely needed to give leadership
to addressing some of these issues that we are – in a broader view, that we are
dealing with, and perhaps this group and groups as such can work together to
bring it to the table to say, Here is the kind of problems that we are dealing
with. What should we do with it? How do we address it?
MR. IAMS: My hope would be that you would do something that would make it
possible to use better data in policy analysis and evaluation by permitting a
greater exchange of linked survey data across agencies with less pain and
suffering or prohibition.
For example, Gene Steuerle yesterday said, Gee, if you had earnings records
tied to Medicare records, you could look at something connected with Medicare
expenditures and lifetime benefits and lifetime earnings.
That would not be possible to happen in the current legal context. Our
administrative data – Well, maybe it could, but I doubt it, because you
wouldn’t have any authorization for the Internal Revenue Service to permit this
My ultimate goal would be if it is for statistical purposes in a safe,
secure environment that is not going to violate confidentiality that any
exchange should be possible with government data or government-linked survey
data to other agencies.
We have some linked survey data that, if ASPI had it, their decision making
would be a heck of a lot better than it is now. Rather than making it up on the
back of an envelope, you would have something that is closer to relationships
that exist. I won’t claim that it is perfection, but it’ll be closer.
The Title 13 constraint of Census data linkage is very, very limiting, in
terms of this kind of thing. A straight policy analysis on something connected
with Medicare or whatever might have nothing to do with data quality. It is a
question of having information connected with a decision on what the agency
supports or doesn’t support.
That kind of exchange in matching up data that currently don’t ever meet
each other isn’t going to happen without legislative changes that make the
statistical purpose and a legitimate or the policy analysis of statistical data
in a secure environment a legitimate activity that the federal agencies can
exchange and share data amongst each other, and I don’t know the prospects of
Brian Harris-Kojetin agreed, when I mentioned it to him this morning, that
we really need legislation that allows data that is tied up in one agency to be
shared for statistical purposes at another.
Of course, you’d want to make sure that it wasn’t going to be used for
administering benefits or taking sanctions or anything of that nature. So you
need some sort of protections like CIPSEA has offered. I don’t know if CIPSEA
is the proper place. One person pointed out if you opened CIPSEA up, you could
have bad things happen to CIPSEA.
But we – I think the exchange of linked data and linking more things would
lead to wiser policy analyses and wiser decisions, and the agencies need to be
able to share more than they can share.
MS. TUREK: (Off mike) – the statistical enclave ASPI would not qualify,
because we are basically a policy office in the Office of the Secretary, but we
are probably among the most intensive data users in HHS, and we have the
broadest data needs, because we really are analyzing all the federal programs,
and if they do open this up, I hope they can figure out a way for us to have
access to the data, too, clearly, under controls, but none of the sharing
agreements I have ever seen would include us.
DR. STEINWACHS: Joan, we have to send you to a secure data center and not
let you out.
MS. TUREK: (Off mike).
MR. IAMS: Well, but Joan couldn’t come and use data at my secure data
center. I am three blocks away from her. We would have to have a special
agreement and you’d have to figure out some Title 13 purpose with Census to
permit that to happen.
So it is not just sending someone to a secure data center. If they are not
from the appropriate agency with – the appropriate legislative requirements,
they can’t –
MS TUREK: (Off mike) – our travel budget.
DR. STEINWACHS: I’ll get Ron in just a second.
I thought you were saying, Howard, that if we could change the legislation,
it would be, in a sense, to create something where someone could, like Joan, go
to a secure data center, do something for ASPI, which otherwise couldn’t be
done if you were saying transmit the data to ASPI, and so on, and that was –
Okay – Ron.
MR. PREVOST: Yes, thank you.
I think I mentioned this in the first day. I just want to reiterate it. I
mean, absent legislation being conducted, I think one of the things that would
help all of us is if there were standardized agreements that – particularly – I
am going to suggest OMB. I don’t know. They may – they wouldn’t – I think the
appropriate person had blessed that said these are the way that we are going to
share data between federal agencies, and they had standardized components and
all the lawyers understand what these standardized components are, and so we
don’t spend years and years and years working on agreements between the
If it was just getting rid of all the – I am not a lawyer. So there is
language I look at and I go, The word is the –. What is the question? But they
interpret it differently. Okay?
So in looking at this, if we had these standardized agreements with a
blessing and a set of procedures that says, Yes, if Agency A and Agency B want
to share data and they both believe they have the right to share data, how can
we cut the time that it takes to do this?
If we could do it in two months, rather than two years, it would be a huge
advantage to the entire federal government.
MS. MADANS: We have kind of had two – several parallel conversations going
on the two days, one of which is about access and confidentiality and the other
is kind of the ease of the linkage.
I think we absolutely need to do what you just said, because it is – It
won’t solve the problem that Howard brought up. I mean, if the legislation says
you can’t do it, then you can’t do it. So we need to fix that, and we need to
figure out how to do it quicker, but a lot of these things are done with the
understanding that there will be very good confidentiality protection.
And Fritz brought up the IRB, and I can tell you that our IRB, which is
very well schooled in statistical uses – because that is basically all we do –
much of what they allow us to do, without getting very explicit consent – I
mean, really going through every possible bad thing that could happen to you –
is because we will protect the confidentiality of the data.
And so while many people really are pushing for more access, the more you
expand that access or there is the perception of that access, the less you
probably are going to be able to do in terms of the linkage, and it shouldn’t
be an either/or, but, at some point, it will be, and maybe we have been – We
have always dealt with it is either public use or it is confidential. There is
nothing in between. You know, it is either in the data center or it is on the
web, and we are changing that. I think we are trying to think through –
especially now that we have the sworn agent – what kinds of risk, what kinds of
– this is kind of the – what would you call it? – the portfolio idea.
But it is going to take a lot to work that out, and while we are figuring
out how to do these standardized agreements, it would be nice to be able to
figure out what are the primary criteria, how you determine what kind of data
you can put where for what kind of access, and, for certain things, I am sorry,
you are going to have to come to a data center, and when – I think it was
Heather, when she said she remembers the punch cards and going to the computer
center at three o’clock in the morning, because it was cheaper and all those
good old days, that we are now in – we have been in a position where data
access has been very easy. We went through a period where – just gave you the
CD-ROM. We put it on the Web.
I have a feeling that was a very, very nice golden age in terms of data
access. That is not going to be the main access route in the future, that there
is going to be more of a range, and we can certainly make people’s life easier,
but users, I think, also have to kind of get a reality check that linked IRS
data with our DNA information is not going to be easily accessible to users,
that we are going to make you jump through lots of hoops before we give you
that, and I think that is our responsibility.
DR. BREEN: I think, though, that – you mentioned we had been talking on a
number of parallel tracks, and I think one is providing data access to users
outside the federal government, but the other is that even – we can’t even
provide access within the federal government in a timely manner.
So I think that we need to think about both of those things, and, then, I
think this notion of the portfolio that Julia had yesterday is a really, really
good idea, and a lot of people have mentioned that, that there can be various
hoops that you hop through depending on how much detail you want, and for
exploratory analysis, and some of the original stuff, maybe a public-use data
set or some very basic information is just fine, and it is only subsequent to
that, when you are – you have found you’ve got enough information to test your
hypotheses and you can write your grant with that, that then you can move
forward with the rest, and maybe we need to relax some of our standards about
how much information you need to provide in that grant, too.
And one other thing I wanted to mention was I think it was Gene who said,
Well, can’t people come in to the federal government and kind of take a look
around and maybe examine the culture and make some suggestions, and, certainly,
they can, through IPAs, and I know at NCI and at NIH, generally, people come in
and do program evaluations. They come in a team of people almost like they
would in an – Well, like they would in an academic department for accredation
purposes to evaluate what is going on and make suggestions on what are the
strengths and weaknesses and where you might want to be in five years or 10
years or something like that.
Plus, on a smaller scale, we have had people come in under IPAs –
anthropologists, management specialists – to evaluate what is going on and to
So all of these things are possible, and I think there are a lot of
creative ways that we can think about using and building on what we already
have as we are trying to get legislation changed, because legislation change is
a long process, and so I hate to put all our eggs in that basket.
MS. GENZER: I wanted to go back to the data-center issue.
I think it is very important if we are moving more to data centers that the
data centers be staffed adequately so that once a person has access to it that
they can then get the data that they need.
For instance, if you need to – if you are granted access to certain
variables, that the agency that is providing the data at the data center has
the staff to be able to get that certain data to you that you have been
authorized to use, so that you are not cooling your heels for an unspecified
period of time, and that is especially important if you are a federal agency
working with a contract and your contractor has a specific schedule and you
don’t want that schedule to be way off whack so that the staff on the
contractor that you have available for your contract is now busy doing
competing contacts, because they were expected to be finished.
So wanted to mention that.
MS. TUREK: I was thinking about what Heather said this morning – A lot of
things we are doing, two months is too long. The issue – we need the answer in
So if we have to go to data centers, there needs to be some kind of
agreement that we can get much faster turnaround than two months, which means
we would have to negotiate a more general agreement that would allow us to do
certain classes of studies rather than a particular project.
I mean, it is in the nature of the policy arena that – two months it is
like the long run. We are all dead. I mean, frequently, the issue – and this is
something the Secretary is interested in, because he is the one who is being
given the information – or the White House or the Hill.
So we go up and say to him, Because we have to go to data centers, we can
no longer do this kind of analysis, it wouldn’t sit too well.
MS. MADANS: So global access.
MS. TUREK: I mean, I think you can – for certain kinds of studies, rather
than each study very specifically deciding that you are allowed to do certain
classes of studies.
MS. MADANS: We had a call from CBO last week, was exactly this, and that is
what we are doing. They are going to have a generic kind of topic-specific, and
when they need to run something, they just have to make sure somebody is there.
Unfortunately, our data center is not busy every day, so there is not someone
always there to welcome you, but as long as we are there, they can come and do
the analysis. So I think that is the kind of thing you have to kind of build
I don’t know – We have different authorizing legislation. I don’t know if
Census can do that, but ours does allow us to be more generic in terms of what
the project is.
MS. TUREK: If I understand with Census, it has to serve a Census purpose.
So you have to find some way that the results can be used that are in line with
your mission, not so – I mean, if Census wanted something and the best place to
get the data was the Census, would it be covered by Title 13 or would they be
told, You wrote this law. Too bad.
MR. PREVOST: Well, I am certainly not the one to speak for every word of
our mission, but, I mean, certainly, what our job is is to disseminate data. I
mean, it does no good to go out and be the collector of data if you cannot
provide it to the people who you are supposed to be delivering it to, and that
is why we have the Research Data Center Network.
And in doing this, and as we have said around the table here, that in using
the data, you can find out where the warts are and you can tell the folks what
they can do to improve the information; that is, if you are working with our
data at a research data center, is a Title 13 purpose, to research and to
suggest improvements to the data that we have at hand.
MR. RILEY: We have talked about some of the administrative data sets and
the fact that, I guess, the lack of documentation on some of them is a barrier
to people using linked data sets, particularly if they are used to, say,
analyzing survey data and they find the prospect of adding administrative data
sort of daunting, and CMS has started the Resdac Project to help new users of
Medicare data. Resdac is a contractor to CMS that helps people, not only use
just Medicare data, but they help SEER Medicare data users and so forth, and
they have taken that on as part of their responsibility.
So that might serve as a model for other agencies that have complex
administrative data sets, and it might help increase the demand for data linked
to those data sets and might be something to be considered.
DR. STEINWACHS: Is there something up on the Web that talks about what they
do? I was just looking for a document.
MR. RILEY: Resdac has its own Website. It is www.resdac. – it is at the
University of Minnesota.
DR. STEINWACHS: So it is probably –
DR. DAVERN: I think it just changed, because all the Websites changed, but
I you can just get there – resdac.org, I think, gets you there.
MR. RILEY: I think it ends in edu, but I think there might be something
about Minnesota in between resdac and –
PARTICIPANT: Google it.
PARTICIPANT: Google it. That’s –
MR. RILEY: Yes, Google it. Google resdac, you’ll get it.
DR. STEINWACHS: Yes, we’ve all got consensus, Google is the way to find it.
I thought I would start bringing us to close, because I know we have lost
some people, and I really appreciate all of you who have stayed throughout
this, because this is really what we had pictured as a real chance to have a
dialogue that was among the committee, the agencies, the users and the
producers and providers of very critical information that supports both the
nation’s mission and, from our point of view, is critical information to
understanding health outcomes and ways to intervene to improve that.
Just so you know what we are doing, we will be looking at what we have
heard and what we have gotten from you, in terms of our capacity to use that to
make specific recommendations to the Secretary and, as that moves ahead, we’ll
be happy to come back and share those products once they are actually approved
by the committee that would be going to the Secretary. They get posted at the
Website at the time that they are actually signed and sent to the Secretary.
There has also been a suggestion here, which we are going to take back,
that maybe there ought to be subsequent meetings, so that we may be coming back
to you talking about what kinds of topics.
There has been a lot of discussion about state data and might have some of
the same people around the table here, but people from the states or others.
There might be other issues, so that we are considering are there next steps
that are important for us to take in learning and gathering information, and
those will probably be more focused.
I think this has been a very open and, to me, a very, very productive – I
sort of thought I knew what was going on, and this convinced me, after about
the first five minutes, I didn’t, and I was here to learn a lot and have
really, really enjoyed it.
I also want to thank Cynthia. Cynthia has disappeared, but Cynthia may hear
me someplace that, without her, we wouldn’t physically be here, and there was a
lot of rushing to get this altogether.
And very much, Joan, I appreciate that you now – I now know you know
everyone and – Well, everyone that is important anyway, and without you, this
wouldn’t have happened and –
MS. TUREK: And I want to say thank you to everybody for agreeing to be part
of it. I think it was the participants who really made it, and, although you
hear me complain a lot, I think all of you are really great, special people and
you do wonderful work.
DR. STEINWACHS: This has been recorded. So, now, you can play it back
anytime you want. Joan has certified. Joan has said –
I also, again, from the committee’s point of view, want to thank Gene and
Nancy, because it was really their leadership that got us going as a committee
on this, and, then, Jim Scanlon said, Hey, ASPI is very interested in this.
Joan is working on this. Brought us together as a team, and so the rest of us
have benefitted from all of that, and certainly benefitted from everything that
you have come here and shared with us. So –
Also, so you know, the audio tape – I think I mentioned this – will be
posted on the Website, once it is done. There will be a written copy of the
transcript of this, and that our plan is to post the slides on the Website,
too, so they are available, so that when people listen to the tape, they can
see the slides as well, as we were going to copy them all for the committee
members, because many – made many points in there, and we don’t want to forget
those, and we want to capture those.
I don’t know if there are any other comments by committee members, but, if
not, thank you all very, very much.
(Whereupon, the workshop adjourned at 3:25 p.m.)