[This Transcript is Unedited]
Department of Health and Human Services
Subcommittee on Privacy, Confidentiality & Security
National Committee on Vital and Health Statistics
“De-Identification and the Health Insurance
Portability and Accountability Act (HIPAA)”
May 24, 2016
Hubert H. Humphrey Building
200 Independence Ave., SW
TABLE OF CONTENTS
- Introductions and Opening Remarks – Linda Kloss, Chair
- Overview and Framing of Current Issue – Simson Garfinkel
- Panel I: Policy Interpretations of HIPAA’s De- Identification Guidance
Speakers: Ira Rubinstein, Bradley Malin, Daniel Barth-Jones
- Panel II: De-Identification Challenges
Speakers: Michelle De Mooy, Jules Polonetsky, Ashley Predith, Cora Tung Han
- Framing of Issues by Subcommittee Members
- Public Comment
P R O C E E D I N G S (9:05 a.m.)
MS. KLOSS: It is 9:05, and we have a very full day. This is the National Committee on Vital and Health Statistics Hearing on De-identification and HIPAA. This hearing is being conducted by the Privacy, Confidentiality and Security Subcommittee of NCVHS. I welcome everyone in the room and on the phone.
As is required for a federal advisory committee, we will begin by having the members of the committee introduce themselves. I am Linda Kloss, member of the Full Committee, co-chair of the Privacy, Confidentiality and Security Subcommittee, member of the Standards Subcommittee. I have no conflicts of interest regarding the topic at hand today. Let’s begin. Vickie, will you introduce yourself and we will just have the members of the subcommittee introduce themselves.
DR. MAYS: Thank you. Good morning. Vickie Mays, University of California, Los Angeles. I am a member of the Subcommittee, I am a member of the Full Committee, I also chair the Workgroup on Data Access and Use, and a member of the Subcommittee on Populations. I have no conflicts.
MR. COUSSOULE: I am Nick Coussoule. I am with BlueCross BlueShield of Tennessee. I am a member of the Full Committee, also a member of the Privacy, Confidentiality and Security Subcommittee and the Standards Subcommittee. I have no conflicts.
DR. SUAREZ: Good morning everyone. My name is Walter Suarez. I am with Kaiser Permanente. I am a member of the Full Committee and serve as the chair of the National Committee. I am also a member of all the workgroups and subcommittees. I don’t have any conflicts.
MS. MILAM: Good morning. I am Sallie Milam with the West Virginia Health Care Authority. I am a member of the Full Committee and a member of the Privacy, Confidentiality and Security Subcommittee and I have no conflicts.
MS. EVANS: I am Barbara Evans. I am a professor at the University of Houston. I am a member of the Full Committee and the Subcommittee, and I have no conflicts.
DR. RIPPEN: Good morning. My name is Helga Rippen from Health Sciences South Carolina and also Clemson and the University of South Carolina. I am a member of the Full Committee, this committee, the Population committee and I have no conflicts.
MS. KLOSS: Could I ask the staff at the table to introduce themselves.
MS. HINES: Good morning, Rebecca Hines, Executive Secretary of the Committee.
MS. SEEGER: Good morning, Rachel Seeger, Assistant Secretary for Planning and Evaluation.
MS. SANCHEZ: Good morning, Linda Sanchez, Office for Civil Rights, staff to the committee.
MS. KLOSS: Are there any members of the committee on the phone joining us by phone, full committee or subcommittee? (No response) Okay. Then we will run the room and introduce those who are joining us in the audience.
MS. KLOSS: Our goals for this meeting are ambitious and important. Our objectives for this hearing are to increase awareness of current and anticipated practices involving de-identified health information, such as the sale of such information to data brokers and other data mining companies for marketing and/or risk mitigation to understand HIPAA’s de-identification requirements in light of these practices, and identify areas where outreach education, technical assistance and policy change or guidance may be useful.
The National Committee does convene, as we are convening today. Often our work product may consist of a letter to the secretary of HHS on health information policy matters. We envision coming out of this at some point with such a letter with recommendations. What we are doing here today and tomorrow is really listening, learning, asking questions. We are very grateful for our illustrious set of individuals who will come to testify. A special thank-you to Rachel Seeger for her terrific work in pulling together the substance of our hearing.
I think with no further introduction, we would like to welcome Dr. Simpson Garfinkle from NIST, and have you provide an overview and help us frame these current issues. Then we will move immediately into panel one. Simpson, welcome.
DR. GARFINKEL: Thank you for the opportunity to speak today about the de-identification of personal information and the Health Insurance Portability and the Accountability Act. My name is Simpson Garfinkle, and I am a computer scientist in the Information Technology Laboratory at the National Institute of Standards and Technology. NIST is a non-regulatory federal agency within the US Department of Commerce. This mission is to promote US innovation and industrial competitiveness by advancing measurement science, standards and technology in ways that enhance economic security and improve the quality of life.
As part of this work, NIST publishes interagency reports that describe technical research of interest. Last October, NIST published NIST interagency report 8053, de-identification of personal information, which covered the current state of de-identification practice. My statement is drawn from that report and supplemented with additional research that has been performed in the interim.
Today there is a significant and growing interest in the practice of de-identification. Many health care providers wish to share their vast reserves of patient data to enable research and to improve the quality of the product that they deliver. These kinds of data transfers drive the need for meaningful ways to alter the content of release data, such that patient privacy is protected. For example, the Precision Medicine Initiative relies on de-identification as one of the tools for protecting participant privacy.
Under the current HIPAA privacy rule, protected health information can be distributed without restriction provided that the data have been appropriately de-identified, that is provided that identifying information such as names, addresses and phone numbers have been removed. Interest in de-identification extends far beyond health care.
De-identification lets social scientists share de-identified survey results without the need to involve an institutional review board. De-identification lets banks and financial institutions share credit card transactions, while promising customers that their personal information is protected. De-identification lets websites collect information about their visitors and share this information with advertisers, all the while promising we will never share your personal information.
Even governments rely on de-identification to let them publish transaction-level records promoting accountability transparency, without jeopardizing the privacy of their citizens. But there is a problem with de-identification. We know that there are de-identified datasets in which some of the records can be re-identified. That is they can be linked back to the original data subject. Sometimes this is because the records were not properly de-identified in the first place. Other times, it is because that information in the dataset is distinctive in some way that was not realized at first. This distinctiveness can be used to link the data back to the original identity.
For example, consider a hypothetical case of a researcher who wants to see if laboratory workers at a university are developing cancer at a disproportionately high rate compared to other university employees. That researcher obtains from the university’s insurer a de-identified dataset containing the title, age and five years of diagnostic codes for every university employee.
It would be sad to learn that a 35-year old professor was diagnosed with ICD-10 code C64.1, malignant kidney cancer. But if there are many 35-year old professors, that data element might not be individually identifying. On the other hand, if a code is linked with a 69-year old professor, that person may be uniquely identified. The data would certainly be revealing if the patient’s title is university president instead of professor.
One of the challenges that we face is that the same properly de-identified dataset today may not be properly de-identified tomorrow. This is because the identifiability of data depends in part on the difficulty of linking that dataset with data elsewhere in the global data sphere. Consider a de-identified medical record for a patient somewhere in the US with an ICD code of A89.4. That record can lie in wait like a digital landmine until some website looking for clicks publishes a list of all the people in the US known to have been diagnosed with Ebola. Equipped with this new data, other information in the de-identified data might now fall into place and single out the patient.
To be fair, each of my examples could be considered a case of improper de-identification. The person who prepared the hypothetical university dataset should have looked at the number of university employees with each title and remove the title for those jobs that were held by fewer than a critical number of employees, for example, 10. Perhaps all of the university senior executives should have had their titles in the dataset changed to a generic title, such as senior administrator.
This is an example of generalization, one of several techniques that is used when de-identifying data. Another technique is called swapping, in which attributes are literally swapped between records that are statistically similar. Swapping lets a downstream user of the data perform many statistical operations and generally get the right answer. But it adds uncertainty to the results. In this example with data-swapping, there would be no way for a downstream user to be sure if the university president has cancer, eczema or the common cold.
Other de-identification techniques include adding small random values called noise to parts of the dataset and entirely removing or suppressing specific columns or rows. Technically, we say that techniques like these decrease the potential for identity disclosure in a data release. But they also reduce the data quality of the resulting dataset.
Suppressing columns is the de-identification technique that is easiest to understand. It is also the one that is most widely used. In fact, it is the technique that is specified by the HIPAA privacy rules Safe Harbor provision. Under the rule, a dataset can be considered de-identified if 18 kinds of identifying information are removed and the entity does not have actual knowledge that the information could be used to loan or in combination with other information to identify an individual who is the subject of the information.
These identifiers include direct identifiers, such as the person’s name, address, phone number and Social Security number. But they also include so-called indirect identifiers, such as a person’s date of birth, the name of their street or their city. These are called indirect identifiers because they do not directly identify a person, but they can triangulate on an identity if several are combined.
Geographic information requires special attention because, just like job titles, some geographic areas are highly identifying, while others aren’t identifying at all. The Safe Harbor rule resolves this disparity by allowing ZIP codes to be included in de-identified data if there are least 20,000 people living in a ZIP. Otherwise, only the first three digits of the ZIP code may be included, assuming once again that there are at least 20,000 people living within the so-called ZIP three area.
Other US business sectors have looked at the Safe Harbor rule and sought to create their own versions. After all, the Safe Harbor offers a clear bargain, except the loss in data quality that comes from removing those 18 data types, and the remaining data are no longer subject to privacy regulation. You can give them to researchers, publish them on the internet or even sell them to a company that will build statistical models. As long as the data provider doesn’t know a specific way that the data can be re-identified, there are not privacy-based limitations on the de-identified data at all.
The problem with this Safe Harbor standard is that it isn’t perfect. We know that there are some people in the dataset that is identified according to the Safe Harbor standard that can be re-identified. One reason is because of statistics. It turns out that in some populations, a few people can be identified just from their sex, their ZIP three and their age in years, all of which are allowed under Safe Harbor.
In 2010, the Office of the National Coordinator for Health Information Technology at the US Department of Health and Human Services conducted a test of the Safe Harbor method. As part of that study, researchers were provided with 15,000 hospital admission records belonging to Hispanic individuals from a particular hospital system. The data covered the years 2004 to 2009. Researchers then attempted to match the de-identified records to a commercially-available dataset of 30,000 records from InfoUSA, a company that claims to have data on 235 million US consumers.
Based on the US Census data, the researchers estimated that the 30,000 commercial records covered approximately 5000 of the hospital patients. When the experimenters matched using sex, ZIP three and age, they found 216 records unique in the hospital data, 84 records unique in the InfoUSA data, and 20 records that matched on both sides.
The researchers then examined each of those 20 matches and determined that just two out of the 20 had the same last name, street address and phone number. This represents a re-identification rate of 0.013 percent. The researchers also calculated the re-identification risk of 0.22 percent using a more conservative methodology. These rates are not a nationwide average, however, since they are based on a single ethnic population in a single health care system.
This example embodies much of the way that de-identification is performed in the worlds of health care, finance and official statistics today. First, direct identifiers are removed from the dataset. Next, indirect identifiers are identified and analyzed to make sure that any specific combination of identifiers ambiguously identifies at least a certain number of individuals.
This number is sometimes called K, a reference to Latanya Sweeney’s K-anonymity model. K-anonymity is not an algorithm that de-identifies data. Instead, it is a framework for measuring the ambiguity of records in a released dataset. Many practitioners will use this number to calculate a re-identification rate. If no fewer than 20 records have the same consolation of indirect identifiers, then they say that the re-identification rate is 5 percent, 1 out of 20, meaning that an attacker trying to match an identity to one of those ambiguous records in the dataset has a 5 percent chance of being right.
Tiger teams are another way to measure the effectiveness of de-identification effort. The idea is to give the data set to a tiger team and have this team try to re-identify the data subjects. That is what was done in the ONC study. This approach is effective because the study coordinators know the ground truth. They know which of the matched records are actually matched individuals, rather than simply being recognized as unique between the two datasets.
Many of the well-publicized academic re-identification attacks in recent years have not taken this extra step of verifying their matches. Instead, they have asserted that dataset uniques prove that the dataset can be re-identified. The ONC study shows that attempts to characterize re-identification rates using dataset uniques rather than verifying against the ground truth may significantly overestimate the re-identification rate.
In the ONC case, the re-identification rate would have been overestimated by a factor of 10. That is because they found 20 datasets uniques between the two. But when they did the test and looked at the ground truth, they only saw that two of them were actual matches.
Another problem with re-identification tests such as this is that they are inherently based upon assumptions about what data are available for matching. However, as more data both identified and identifiable become available, those assumptions may no longer hold. For example, web searches may become available that allow the matching. Or there may be information in social media that allows matching. That might not be considered by the original re-identification calculation.
As data are more widely used and distributed within our society, and as we learn better how to tease identifiable information out of what was previously thought to be unidentifiable data, we will need de-identification techniques that provide stronger, more measurable privacy guarantees than those that are provided by Safe Harbor. That is because there may be other data elements that are not considered identifying by data scientists, but which might be identifiable to a friend or a relative or a neighbor.
I have an example here. Consider a person who has a constellation of several diseases or accidents over the course of several years. There may be only one person who has that particular specific combination. That might make it possible to identify that person in the dataset. The de-identified records might be linked to other information, which would provide additional information about the person that wasn’t known. This is likely true not just for people who have suffered from rare disorders, but for many members of the population.
In 2012, researchers at Vanderbilt University and the University of Texas, including Brad Malin, who is on today’s first panel, showed that an attacker who obtains five to seven laboratory results from a single patient can use those results as a search key to find matching records in a de-identified biomedical research database. Medical test results aren’t classified as indirect identifiers. But it turns out that each of those individual results in a CBC or CHEM-7 test has enough variation taken together that they can directly identify a person.
Of course, you can’t identify a person today based on a CHEM-7 test taken a year ago. The threat model is much simpler. Each medical test is unique. If all of the tests taken by the same person are given the same pseudonym in that research database, and somewhere else there is a test result that doesn’t have the pseudonym, but has the patient’s real name, then that de-identified test result can be used as a key into the research database to find the matching record and to re-identify the ones that are linked. Unfortunately, these databases of medical histories and treatments are precisely the kinds of databases that need to be created and made widely available for projects like the Precision Medicine Initiative to succeed.
One solution, as outlined in the 2012 paper, is to add noise to the non-identifying values before the data are released just to make sure that they can’t be linked with another dataset. But many clinicians and medical researchers are apprehensive about the idea of adding noise. They are afraid that the noise may result in incorrect conclusions being drawn from the data.
But the 2012 paper showed that it is possible to add noise in a way that the clinical meaning of the laboratory test remains the same. The question is how much noise should be added. Like the value of K or the re-identification rate, that is a policy question, not a technology question. More noise will lower both the data quality and the identifiable of the resulting data.
Differential privacy is a framework that describes the relationship between the amount of noise that is added and the amount of privacy protection that results. One of the intellectually-attractive aspects of differential privacy is that its privacy guarantee exists independent of any past or future data releases.
Unfortunately, this guarantee comes at a heavy cost to data quality, especially if the noise that is added is added in an unsophisticated way. Differential privacy was developed to support query systems. The basic idea was to allow researchers to perform statistical computations without having the system link personal information about any of the individuals in the database.
These sorts of query systems, sometimes called trusted enclaves, are in use today, although rarely with the mathematical formalisms and guarantees that differential privacy provides. One of the early discoveries in differential privacy is that there is a quantifiable privacy budget that the query systems have. Only a certain number of questions can be answered because answering more questions inevitably compromises more privacy.
Several approaches have been developed to minimize the impact of this privacy budget. For example, instead of answering questions, the entire privacy budget can be spent publishing a new synthetic dataset. Such a dataset can be essentially thought as the original dataset with a few columns dropped and swapped or altered to protect privacy. Or it can be a wholly artificial dataset that is statistically faithful to the original dataset, but for which there is no one-to-one mapping between any individual in the original dataset and data that are made publicly available.
This approach is being used by the Census Bureau today. It has a synthetic dataset for its survey of incoming program participation. According to the Bureau’s website, the purpose of the SIPP is to provide access to linked data that are usually not publicly available due to confidentiality concerns.
What this means is that for every person in the original dataset, there is a corresponding person in the synthetic dataset. But only the gender and a link to the individual’s first reported marital partner are unaltered by the synthesis process. All the rest of the information has been changed. Re-identification is not unlikely. It is impossible since the synthetic people are statistical combinations of real people.
Synthetic and artificial datasets, and I am quite ahead, pose a challenge to researchers and to the general public. A synthetic dataset is designed to allow research on hospital accidents nationwide might let researchers draw accurate generalizable conclusions about the impact of training and doctors’ work hours on patient outcomes. But it might make it mathematically impossible to identify specific patients, doctors or hospitals.
Such a dataset would be useful as for the purpose of accountability or transparency because it couldn’t identify specific doctors or hospitals. But it would certainly protect the patient privacy. Whether this kind of approach is one that should be used in health care, again, that is a policy question, not a technical question.
Once the de-identification strategy and mechanisms are decided upon, they need to be formally evaluated. Some of the evaluation questions that you might have is does the de-identification strategy meet the stated policy goals. Does the software that implements that strategy, is it faithful to the algorithm? You might have software that you think implements K-anonymity, but it might have bugs. Or it might actually be implementing a completely different algorithm altogether.
It might be that the statistical privacy guarantees promised by the software are not actually met. Again, it might have a bug. It might be implementing that algorithm, but not in the correct manner. It might be that the software does what it is supposed to do, but there are reliability problems, such that when the same person uses the same software, because of usability errors in the user interface or because of other aspects in the way the system is being used, you don’t get the same results each time.
It might be that there is the use of randomness inside the software. Sometimes it generates the correct answers. Sometimes it generates answers that are off. It might be that institutions conducting the de-identification don’t have the necessary training and procedures in place, so that the software isn’t run reliably because of human error. It might be that you need to do ongoing monitoring and auditing to make sure that the assumptions about the de-identification continue to hold. You have to make sure that the institutions have that kind of monitoring in place.
Finally, I would like to briefly mention the de-identification of non-tabular data. This is a significant challenge. Free-format text, photographs, video and genetic sequences can all contain information that is highly-identifiable. There is nevertheless a need to be able to legally de-identify these data and share them without restrictions. They need the same sort of bargain that the HIPAA Privacy Rule provides.
One approach for sharing these kinds of data are technical controls, such as data-sharing agreements or legal penalties for re-identification attempts. Another approach is the use of synthetic data. More research is needed to determine if systems could be developed that protect privacy and allow unlimited use of data. Other research is needed to determine a process that can transform raw data so completely that individuals cannot recognize their own data once they are in a crowd. This would solve the especially difficult problem of preventing re-identification of data elements by close friends and family members.
Techniques that prevent science must also maintain data quality. Synthetic data may be the only way to accomplish this goal. Thank you very much. I guess I have a few minutes. In summary, we have learned a lot about de-identification in recent years. There is this de-identification toolkit. In health care, we are commonly using suppression and generalization. Field swapping and noise addition are commonly being used in vital statistics, but not so much in health care.
We have these formal models called K-anonymity and differential privacy for evaluating the quality of de-identification. We increasingly have the ability to modify data, so that the data subject’s identity is removed, leaving information that is somewhat useful. The more useful the data is, the more likely it can be re-identified.
What we need is procedures for evaluating the effectiveness of de-identification and evaluating the usefulness of the data that remain. We need these techniques that work on a wide range of data, structured data, text, medical records and video. Thank you again.
MS. KLOSS: Could we leave this slide up, please? I think it will trigger certainly some questions. We would like to, I think, go next to questions for you, if that is okay. First, I would like Bob to introduce yourself as a member of the committee who came in after we did the introductions.
DR. PHILLIPS: Bob Phillips with the American Board of Family Medicine. Thank you. No conflicts.
MS. BERNSTEIN: Maya Bernstein. I am usually the lead staff to the subcommittee, but I have been on detail. The lead staff today is Rachel Seeger.
MS. KLOSS: Okay. I think we are open for questions. Thank you so much. Anyone want to lead off?
MS. MILAM: I think you highlighted the richness of the topic and the complexity. As I think about states and local government and communities, I am wondering what advice you might give them given the importance of the use of the data and the real need to improve health care. How organizations, especially smaller governments with fewer resources might approach making data available on the web given all of these issues.
DR. GARFINKEL: Thank you for that question. That is a broad question of public interest today, not just in health care but in the President’s Police Data Initiative. It is also an issue for private entities that want to make data widely available.
Right now, there are efforts to standardize curricula for de-identification practitioners. There are efforts to add information about de-identification to master’s programs and to provide training that is available online. Unfortunately, every dataset seems to have different de-identification challenges. Information in one data set, there may be errors in the coding of the dataset, such that even though you think that particular identified information is only in one column, it may be in other columns, as well. The result is that some of the records may be identified when it is publicly released.
So to answer your question, what is needed is people who are trained in this area, and whether those small organizations that have data bring in consultants or they try to staff up. But right now, the advice that I have is to go slow and understand that once data are released, it is very difficult to get them back. More data can always be released. But right now, we are in the beginning. It is really important to be careful.
DR. MAYS: I am wondering if you are asking for two things. I just want to get clear about it. Are you asking that we have better science for like statistical and methodological approaches that will help us to be able to identify ways in which we might accidentally de-identify? Secondly, are you asking for standards that should go with the dataset that would tell a person, for example, if you have swapped or hot-decked or done something which may cause the data to have less accuracy when we do our analyses?
DR. GARFINKEL: So, right now we have science that is much better than the practice. There has been a lot of work in this area over the past 10 years. My colleagues here have published excellent academic articles on this topic. A lot of that technology is not being used in the field. There is a need for better science. But there is also a need to innovate and actually move, to transition that science from the Academy to practice.
Now, your second question, though, is a question of data providence. That is it would be useful to have standard ways of labeling datasets, so that the users of those datasets knew the kinds of transformations that they had experienced. I know of another organization that has a mechanism where they can run queries on the real data and run queries on the de-identified data, and report if they are significantly different.
Now, people who work in the area of differential privacy will tell you that leaks information about the real disease. But that is a measurable information leak. It is a useful practice to do. It is useful to be able to run queries on both real data and on the de-identified data. That is a model that should be explored, as well.
DR. SUAREZ: Great presentation. I have two types of questions. One is actually from your last statement about the fact that science is much farther than practice in this respect. It seems like then the question becomes how can we accelerate a practice of the science?
In some ways, the industry itself needs to find the appropriate ways to ensure that the science is being applied. In other words, you need to write incentives. You need to write mechanisms to have the industry begin to apply the science much more rigorously, I guess. I know this is more of a policy question from what are the right incentives in health policy that would drive an increase in the use of science to apply to the practice of de-identification. There probably are a number of them.
But it seems that the question becomes really what would be some of those policy rivers that would push for that science to be more rigorously applied to the practice in the health care industry. What might be some of the policy drivers? That might be one of the areas we need to discuss throughout this hearing. It seems we have all this science. We have all this technique, all these methodologies. Some quite complex, as you have noted in your remarks. It seems like in order to make sure that science is applied, we need to have in the industry some layer of incentives. That is the first question, what do you see as possible incentives.
Then the second question is more about guidance and whether NIST, as an organization that offers a lot of guidance around many areas, has done or would be an entity that could provide some guidance to understand better how to apply all the science into the practice.
DR. GARFINKEL: That is a big question. I haven’t done research on what incentives would be useful. I don’t think I can answer that question. As far as the guidance that NIST could apply, we are working on a number of projects in that area. Hopefully we will have something out in the future.
DR. RIPPEN: Again, thank you so much. Very informative and very on topic, but then I guess I am biased. Thank you. When you talk about the synthetic tools or the approach, I guess the first question is how difficult is it. Can one provide tools and guidance that could help them be implemented? As you know, data scientists and statisticians are becoming harder to find. The question of how does one facilitate good practices in a way that can be disseminated is very important.
Then the second question really revolves around something that you highlighted, which is you can create noise within certain datasets. But then you might lose it as it relates to the health care provider and things like that. So could one have best practices where, if you know what the intent is, you could put out different versions f synthetic data that would help address certain questions without actually increasing additional opportunities for recombining. That is the second question.
DR. GARFINKEL: So, the first question you asked me is how difficult are tools that make synthetic data. Such tools are not widely available at the present time. There are several tools. We are conducting an evaluation this summer of one of them. There is a number of research tools that are available. They could be maybe easier to use. There are some commercial tools in that area, also.
Regarding creating multiple datasets, so when a de-identification is performed, it is best to have an anticipated use of the dataset because different fields can be made more faithful or less faithful. Clearly, if you bring out multiple datasets, there might be a possibility for correlation. I don’t know the math, but I know people who do know the math. It is complicated. I have always learned not to engineer solutions when speaking off the cuff. I can’t go further than that.
MS. MILAM: I have a follow-up question to Walter’s question. I was thinking about technical assistance to states. As Walter was talking, I was thinking about the Monarch product that HRQ offers states. I am wondering if it is possible for HHS or others, probably HHS, to offer such technical toolkits to stand up data web query systems with your common datasets.
Now, certainly there are a lot of one-off datasets, but there are a lot of datasets that are common to all states. Is it possible? Is it a good idea to have these websites, these web query systems out of the box that states and locals can apply that have appropriate DID built in?
MS. BERNSTEIN: Can you explain for those of us ignorant what the Monarch thing you just mentioned is?
MS. MILAM: HRQ provides, and I am sure others can explain this better than I can, it is a website product that states can utilize. A lot of states have data out on Monarch. You can search by population or by hospital. You can see rates. You can see information about different disease states. You can search on any number of factors, but you put hospital discharge data into it.
You utilize the Monarch product. Now, there are choices within that produce that go from more aggregate to more granular. You select those choices based on your policy. I am wondering if there is a way to shape tools for states, so that they can get the data out there where it is needed, but do it in a way that it is productive. Would there be any benefit to that?
DR. GARFINKEL: That is a question that I really shouldn’t answer because I am not familiar with the specifics. There might be data aggregation issues that might cause problems or that you could leverage. I would have to research that in detail. Clearly, there are many models that software and data release and data warehousing could be combined productively. I am sorry.
MR. COUSSOULE: I have a different question that is related. I am sure we will get into this more with the next couple of days. Most of what we are talking about are the techniques to use the data, kind of what I will call for the greater good. Whether it is research or whether it is publication, et cetera. How do the policies in regards to individual privacy and individual’s decisions of how their data gets used either help or hinder that process?
DR. GARFINKEL: Are you asking how these policies affect the willingness of individuals that participate in the data economy?
MR. COUSSOULE: I am just trying to think through some of the implications that if we look at tools and techniques to be able to de-identify the data to use for any number of different reasons, whether it be directed research or more generalized research. Then we weigh that off against individuals who sit back and say, I don’t want my data used for anything outside of that. How do you deal with reconciling those from a policy perspective to still create the utility that you are trying to get out of it? And yet, give people the ability to say, wait a second. I am not sure I want my data anywhere.
DR. GARFINKEL: So, some organizations have adopted policies allowing individuals to suppress the release of their information. That can cause systematic bias and release datasets. Again, this is a policy question. I can’t tell you the right answer.
One of the reasons that NIST is engaged in privacy research is if people feel more comfortable that their privacy is protected in these data releases, they will be more willing to participate in the data economy. There is a belief that we have to solve. This is not a NIST belief. This is a general belief that we have to get the privacy right in order to preserve the use of data for research.
MS. KLOSS: We have time for one more question. The report that you released on the current state of de-identification was a terrific report. It was shared with our subcommittee. What is being done with that? Is it being disseminated? Are you taking it further? Is that a jumping off point for additional research? How will that great work be leveraged?
DR. GARFINKEL: We are doing work this summer on de-identification. We have a pilot study going on. We hope to be able to evaluate de-identification software. We are, in fact, looking for organizations that want to participate with that.
MS. KLOSS: The timing on that project to evaluate de-identification software?
DR. GARFINKEL: We have a pilot project going on right now.
MS. KLOSS: Any other NIST agendas that will flow from this work?
DR. GARFINKEL: I am really not at liberty to speak about what is going to be happening in the future.
MS. KLOSS: We are just wondering how we should continue to learn from the work that you are doing.
(Audio goes out for one minute)
MS. KLOSS: Ira, are you on the phone? Are you on mute? (No response). Okay, we will move on.
MS. KLOSS: Brad Malin. We will check in with him as we go along. We will ask Brad if he will step in. Thank you.
DR. MALIN: So, thanks a lot for the invitation to be here today. This is a topic that is quite dear to my heart, as you can imagine. I have given you written testimony. I would encourage you to use that as a supplement to what I am going to show you.
The presentation that I have prepared is directly in line and has the same progression as what is within the statement. Most of the references that might be of use to you are going to be documented within there.
For those of you who are not familiar with me, I am currently an Associate Professor of Biomedical Informatics and Computer Science at the university.
MS. KLOSS: I think Brad is teed up to begin. If it is convenient with you, Ira, we will proceed. Okay.
DR. MALIN: So I am not Ira, but I will continue to go as Brad Malin. As I was saying, I am at Vanderbilt University. I am also the vice chair for research of biomedical informatics at the Vanderbilt University Medical Center. I run a health data science center within the university.
I will state several conflicts of interest to begin. First, between 2009 through 2012, as Linda Sanchez knows, I was a paid consultant to the Office for Civil Rights at HHS. I was involved in the development of the HIPAA de-identification guidance that I have been asked to provide comments on today. Between 2006 to current moment, I have been a consultant to a number of companies, as well as NGOs and universities, with respect to de-identification. If you want the full list, I can provide that to you afterwards.
I didn’t want to begin by saying from the outset that I believe that what you are doing in doing an investigation into the guidance and where we stand today currently with de-identification is important. At the same time, I wanted to point out that I do believe that at the present moment in time, de-identification, as we are currently performing it, may actually be safe.
This is an artifact of a study from a couple of years ago that I did with Colette Alamom. In 2011, we reviewed all known actual re-identification attacks, specifically attacks on health data. At that time, there were 14 published re-identification attacks on any types of data. I will note this number has really not changed that much in the last couple of years.
Of these 14, 11 were conducted by people like me as demonstration attacks, mainly academicians. Ten of the 14 attacks actually verified the results. Only two of those attacks actually followed any standard. Of those two, the average rate of the identification was, as Simpson was alluding to, on the order of .013 percent.
So Simpson went through the details of this study. I will note that it was brought about by HHS. These were 15,000 records. It was InfoUSA that was the commercial broker. It led to the correct identification of two people. You have already heard the story, so I am going to skip over it.
We were asked several questions today. I wanted to focus particularly on those questions. Then we can jump off from there. The first one was what issues do you see as being the most pressing when you consider de-identification in HIPAA? The first most pressing issue I see is that if you look at this Safe Harbor model that has been put into practice, which is remove 18 identifiers and then move forward, this has the potential to become a very dangerous practice because it allows for an unlimited feature set.
You can have an individual’s race. You can have their ethnicity. You can disclose their sex, their gender. You can disclose sex and gender, the number of marriages that they have had, what job they have, what education level they have, socioeconomic status. How many children do they have? What is their primary language spoken? What is their religion? What is the number of children who are college dropouts? Who is their insurer? It can go on and on and on. There is nothing that precludes anything that is listed up here.
This is a concern. This list can be infinite in length. But where I start to really get concerned is that this is a compounding problem. You can say, for instance, that we have one person and they are a female and they are married to another female. They have a child who is a female. Actually, they have three children. One is a male, one is a female, one is a male. We know all of these things about these people because those links exist, and there is nothing that prevents that from occurring in Safe Harbor.
We can go further and say that one person or this family has the senior generation living with them in their house. These individuals are married. One of them might actually be deceased, for instance. All of this can be disclosed. These things become unique and distinguishing pretty quickly. That is a concern that I have.
The second concern, so this is a natural language statement. This is something that you might come across in a medical record. It says this patient has a really interesting history. Oh, by the way, they are married to the deacon of the local church. The problem is that, well, there is only one church in that region with a deacon. But who knew, right? Maybe it wasn’t documented very well. Churches come up, churches go down, people are hired, people are fired, and there is only one deacon. But we didn’t know this.
There is nothing explicitly identifying in there. But these types of statements, which are becoming extremely common in electronic medical record systems, but moreover in like patient portals where you have patients just conversing with their clinicians or nurses or any type of a care provider who is assisting them. This is becoming the norm where it is just these long conversations. There is nothing that actually stops this from being disclosed in the context of Safe Harbor. That again is a concern to me.
Let’s flip the perspective a little bit and go away from Safe Harbor and go to the other implementation that has been given to us as the identification experts, which is what you call this expert determination or the statistical standard. When you do an expert determination, you are supposed to take an anticipated recipient into the account. That becomes part of your adversarial threat. This is a wonderful concept. It really indicates that you can create protection models that are proportional to the person you are giving or the organization that you are giving the data. This is a fantastic notion.
But who exactly is the anticipated recipient becomes a very tricky thing. For instance, I can give an anticipated recipient on Vanderbilt University, I may go give it to Duke. I give them my de-identification data. Duke then has a security breach. All of the data ends up getting exposed, at which point anybody can start downloading this data. Or it can end up on BitTorrent, and then it may leave in perpetuity, which is exactly what happened to the AOL dataset that has been alluded to in other types of situations. We can talk about that.
Now I have this unanticipated recipient, who while technically security breaches happen. Should I have accounted for them as an anticipated recipient? It could be anybody that is the anticipated recipient. Now that this information is exposed, anybody can download it.
One way that experts are trying to deal with this is they try to characterize the probabilities of these events. The anticipated recipient will have a certain chance of being able to identify somebody given their subject to some kind of a contractual agreement. But then there is a certain probability of this leakage. It might be very small.
But when it happens, these people who are not necessarily subject to some kind of contractual agreement may actually just hammer on the data until they are correct. This is an open question of how we are supposed to resolve this when experts go out there and do certifications.
One other issue in the context of this first question that was posed to us is about this notion of OMICs data, which is as high-dimensional as you can possibly imagine. It is as detailed as the fidelity of the tool we used to generate it. What I am showing you here, this is 500 single nucleotide polymorphisms that we have derived from DNA. It is not real, as Simpson asked me beforehand.
I would guarantee that if these are normal variants, variants that would be of interest for investigate study, you would only need about 60 or 70 of these in combination to uniquely represent an individual. This is research that Russ Altman’s group did over 10 years ago.
But from an identifiability perspective, you have to ask the question of who is this. Just because it is uniquely distinguishing doesn’t necessarily mean that I have a database of named DNA sequences that anybody in the world is just going to be able to query. It doesn’t mean it doesn’t exist. It doesn’t mean that you have access to it, though.
I am going to go on record and say I think it is premature to designate DNA as an identifier or any OMICs data as an identifier. Ellen Clayton and I have been doing a lot of research into this issue. The references have been provided to you. This is an also that OHRP is addressing or investigating at the present moment in time as they are considering their revision to the common rule. I don’t think that it is something that should be addressed specifically in the context of de-identification at the present moment.
Okay, so that was question one. Question two, is the current HIPAA de-identification guidance sufficient, and does it post challenges to, or does it advance the use of data in health care? One thing that should be made clear is that the regs say that you do a risk assessment, and you ensure that there is a small risk that somebody could be uniquely identified.
Risk means something different to everyone. This is a real challenge because experts don’t agree on what is the right risk definition. This little Venn diagram that I have got out on the right, imagine that these little Fs are groups of people who show up in a medical center. These are all the 20-year old Asian males who showed up at Vanderbilt.
One type of a risk measure says that your data set is as risky as the smallest group. You have got one group of only one person. You have got 100 percent risk in your system. The other perspective is that, well, that is a sample that was taken from a larger population of all of the 20-year old males who could have shown up at Vanderbilt. They lived in that region. Some people say that risk, which is what people call journalist risk, is that, well, they didn’t know they were in your dataset. So the risk is proportional to the smallest of those big Fs.
Other people say we don’t know who is going to be targeted in a dataset. Instead, we say that we look at the average risk across everybody in the system. They are all at potential risk of exposure. Instead of saying that the risk is one, we might say that it is the average of these people, whatever that may turn out to be. It will be smaller than one. This is still an open question of which of these risk measures you are supposed to use.
More recently, including some of our own research, we have been looking at adversarial perspective in terms of what drives somebody to actually commit an identification. Why do they actually want to commit maleficence in this system? The expectation is that data has worth. If it has no worth, then there is absolutely no reason to do something like this.
If you have a bunch of different records, somebody is going to go for the one that they have the greatest chance of making a lot of money on and the smallest chance of being detected for doing anything against it. The challenge in all of this is that this turns into a pricing structure. Everybody would have to come to some type of an agreement of how do you value the data in a system. Are HIV patients more worthy than people who have broken left toenails? If the person with the broken left toenail is our president, that might actually be something that somebody is interested in. That is a challenge.
The third question is what was the points of confusion or challenges related to HIPAA? What are options for resolving these? I have given you a lot of challenges so far. But let me give you a couple more just to think about.
The first one is this notion of a time-limited determination. Simpson was alluding to this, as well, in that he was saying that over time, more data gets out there. There is a potential for it to lead to exposure of existing data that is in the market. Experts use time limitations to say, you can have this information under a de-identification determination for a year or 18 months or two years because that is the expectation of how long it is going to be protected. After which time, it is no longer subject to the determination, and you have to get it recertified.
It is unclear what is the appropriate amount of time. There are chances that if you have the wrong expert working with an organization trying to share data, they might say that determination is good for a day or a week. By the way, you need to come back and pay me again because I am the one who is doing your certification. I worry about exploitation in the system with respect to these types of determinations.
You then have to ask questions of what happens when that limitation expires. How do you guarantee that the organization that you provided that data to actually destroys it. Can you go in there and run an audit to ensure that they have done so? There is a cost associated with that.
There are also questions of could you continue to use data that has already been shared. If a research study is being conducted, and the study has not been completed yet, and the determination expires, what happens at that point?
What happens if you want to re-up your certification, and the original certifier is no longer available? Who do you go to in order to try to continue to use that data? That is all of the questions.
One of the largest challenges I have seen over the last decade is you are seeing this amalgamation of databases into uber databases. You have different certifiers who have been involved in the initial determinations. You may have a dataset A. It has its own determination. You have a dataset B. It has its own determination. Now this company B says, I want to pull data from A. I want to integrate it. I want it to still be de-identified.
You have to ask who is responsible for actually assessing if this is still the case. Can an expert from this group make that determination when it is shared over to group B? Or is it the receiver of the data who is actually requesting it to make that determination?
Is it both A and B’s responsibility? Are they jointly liable? I don’t know. I don’t know if that is something that even has to play out in court. It would be great if there was some type of guidance that was set forth beforehand, before we got into that type of situation.
There are open questions about this notion of noise and perturbation in general in the system. Safe Harbor states, for instance, that dates must be no more specific than one year. But there is a question of could you shift dates and still meet this requirement.
George Hripcsak and I wrote a paper on this, and it actually came out three months ago, that talked about what we think is a fair shifting strategy that would meet this type of a requirement. It is a shift and truncate model. The way it works is you shift information into the future, and you truncate if it is in a region that hasn’t actually come to fruition yet. I won’t go through the details of this, but it is a method. It exists.
There are challenges, though, with this method. For instance, you could end up with semantic break. What happens if you show that a drug has been released before the date it has been reported at that point? Somebody goes, oh, well, that is not possible. That had to be a very different date. There is a problem with the method at that point that may or may not be accounted for.
Recently, there has also been proposals for doing what I would call text-based steganography, or what other people call hiding in plain sight, for national language text in particular. This is where about 80 percent of all of the medical information currently lives. It is not all structured. It is in the communications, as I was alluding to.
Here is an example where you have all of the personal identifiers listed in this document. If you use the state of the art with redaction techniques, even machine-learning based approaches, for trying to find all the identifiers and redact them, some leak through. It is almost always the case. It is impossible to get this up to 100 percent.
The work that we have been doing with David Carrell’s group at Group Health, which I guess is now part of Kaiser, is we work on reconstitution of things that have been suppressed with fake information. We have done evaluations of this, both with humans, as well as machine-based attacks, to show that you can’t really compromise the system in a way that is better than random.
But you try getting this to be certified by an IRB. IRBs do not feel comfortable with this for the most part at the present moment in time when we brought this to them. It doesn’t mean that they won’t do it. It is just that they don’t have guidance on whether or not they can do this.
You have actual names or dates that will leak through. You just have them in a way that they can’t be discerned from other things. There is this question of are you actually putting people in harm’s way at that point. We don’t believe so, but some IRBs have interpreted it that way.
The next question was what is your perspective of oversight for an authorized re-identification of de-identified data? This is an interesting question that was posed to us. De-identified data has no oversight. This question, I find it to be getting snookered a little bit. But what I would say is that we should actually make this possible. Oversight currently is being made contractual.
You are taking it out of federal, and you are putting it into civil, specifically under these expert determinations. They are often tied to contracts which say you will not attempt to re-identify this data. You will not combine it with resources that could lead to identification. If there is a violation of that, though, this has no relationship to HIPAA. This is all in contract land, which I think is a straight precedent to set.
You could take oversight and make it a government initiative, similar to the way the security rule works in that you have an investigator who goes in to check and see if the appropriate protections are being put in place. This will have to be done at a cost. Somebody has to do those investigations. I do not know if the federal government wants to take this on.
Moreover, I would ask anybody to show me how you would determine if de-identified data has been re-identified. You basically have to find somebody who either discloses the fact that they are re-identifying it, like the academicians. You will hear from Yaniv Erlich tomorrow, how publishes it in Science. Then he says, look, I re-identify.
Or you have to try and figure it out post hoc, where it might have happened behind closed doors within a company that you may never have heard of. These types of protections probably need to be put in place up front. You need to do audits even before the data goes out the door, as opposed to try to discover it afterwards.
What recommendations would you make to help keep policy of pace with or ahead of technology? I think there are several opportunities here. Providing for some oversight of de-identified data, I really strongly believe that de-identified data should not get an exemption. The fact that it has zero oversight associated with it is just, yes, I will go on record and say it, I think it is ludicrous.
I think it would be very useful to have a clearinghouse for best practices and de-identification. I don’t think that we should be dictating towards organizations of what methods they should or should not use. But there should be some type of an agreement upon how you should be using these methods and how it has been used in the past.
I think that we need to come to some type of an agreement for setting what is small risk in this expert determination model. You put 10 experts in a room and ask them, and you get 100 answers. That is unfortunate.
I will stop there. I still want to reiterate the fact that I think de-identification is working really well today. But I think there are opportunities for making it better. If you have any clarifications or questions that we don’t address in the rest of the panel, feel free to contact me afterwards.
MS. KLOSS: The way our agenda is flowing, we will have testimony from Ira and Daniel-Barth Jones. Then we will take a break. Then we have time for open discussion with our whole morning panel. Ira Rubinstein, are you on the phone now? Can we ask that you provide your testimony to us?
MR. RUBINSTEIN: My name is Ira Rubinstein. I teach privacy law at NYU Law School. I recently co-authored a paper titled Anonymization and Risk with Woody Hartzog. I am going to mainly be speaking about how that paper applies to the same set of questions that Brad just reviewed.
I would add to that I am participating in this as something of an outsider. I don’t professionally work in the field of de-identification per se. But in the paper I alluded to, Woody and I conducted a very thorough review of the relevant legal and technical literature around de-identification and anonymization. Then drew some very broad conclusions from that. My perspective is going to be a little bit higher level and a little bit more abstract than what you just heard.
In terms of the main thesis of our article, after reviewing all of this literature, we came to the conclusion that these incidents that have been noted in the literature that are said to demonstrate the failure of anonymization or the ease with which re-identification can be performed has led to a certain stagnation in the debate. There is just a lot of disagreement among experts from different disciplines.
Rather than try to resolve that because we did not have the technical expertise to do so, we argued that the way to move forward was to try to reframe the debate in terms of what we refer to as data release policy, which is a somewhat broader notion than just de-identification alone. The focus of this data release policy should always be on the process of minimizing risk as opposed to just preventing harm where harm is identified as some specific outcome.
What that boils down to really is the sync of data release issues much more in terms of data security policy. As you will see from what I am about to say, this would represent a fairly significant shift in the way the privacy rule is written and the way it works. At the same time, I think we would end up in a place that is not necessarily all that radically different in that the conclusions that Brad ended with are conclusions that I would endorse, as well. I think there is room for substantial overlap.
By way of a little more background to our findings in this paper, it is clear to us and I think to any observer that there is a very large literature now around this topic of anonymization and de-identification. Part of it is due to the intense press coverage of the various well-known incidents dating from Governor Well to AOL to Netflix in which there has been an apparent re-identification of de-identified data.
Here is where we found that the critics and defenders of these two positions, namely one position being as Brad suggested that the de-identification techniques are working quite well, that re-identification attacks are largely unsuccessful in the sense of identifying unique individuals on the one hand. Those who instead take a more abstract or mathematical approach and provide proof of concepts for re-identification, and then claim that this is very easily accomplished.
This debate indicates a very sharp division within the field. Again, I am not in a position to judge this. It is not that I am doubting Brad’s comments. I just can’t evaluate it. But at the same time, I have read the other side of the literature, which is rather vehement at times. I have attended at least some conferences where representatives of both sides are in the room. I was really struck by how little engagement there was between these two different approaches.
As a non-scientist, I guess I had a naïve faith that if you put scientists in a room, they could come to some conclusion. That simply hasn’t been my experience, either from reading the literature or going to some conferences where there was just almost a mutual distrust or a mutual disdain between the two sides in this debate.
That said, though, I think that from our observations in writing this paper, we concluded that there is now a great deal of skepticism and some legal or regulatory uncertainty around whether de-identification is still a successful approach. That may have begun to leak into the popular literature sufficiently to even affect potential participants who are already quite nervous about how well their privacy can be protected. In the worst case scenario, may then refuse to participate in important studies on the ground, whether they are well-founded or not, that their privacy won’t be protected.
Further, again to emphasis this point that there is just this high level of scientific discord between the two groups that we came to refer to as pragmatists and formalists. I guess I would classify Brad as the pragmatist in the sense that his goal is really to help researchers achieve their scientific goals by providing sound methods for the release of data.
Whereas the formalists are much more interested in mounting very general and abstract attacks, and are less concerned with the empirical side of it and just don’t seem to be interested in empirical arguments at all. I think the net result of this is that we are at a state where it is extremely difficult for policymakers to judge whether de-identification is in need of just some improvement or some radical change or should be abandoned entirely.
With this background in mind, the conclusion that we came to is that there is perhaps just too much focus in this debate on de-identification alone, rather than on the broader spectrum of disclosure limitations, techniques and related tools. I think this is due to the fact that if de-identification were to work properly, and this is a much broader seam in privacy law, then you could have this class of data that does not lead to any privacy concern, and hence would not need to be subject to any regulatory oversight.
In a way, that is kind of the silver bullet that policymakers have been accustomed to over the years. It is reflected more broadly in privacy law and the distinction in virtually every privacy law in every country between personally identifying information, and PII and non-PII. Or in Europe between personal data and non-personal data. The way to get out from under regulatory oversight is to take any data that may be collected or used that is PII or personal data, and change it into non-PII or non-personal data.
I think there is a broad agreement now, at least in the privacy literature, that this quest for perfect anonymization that would allow a method of compliance based on just this type of transformation and not being subject to regulatory oversight. This whole quest is coming to an end as a result of the very broad availability of many different datasets that are either publicly released or readily available. That big data more generally has sparked a change in this approach.
Again, given that background, we began to look more broadly at other methods under the SDL banner, and found that there were, in fact, a variety of methods. Using some articles from the literature, we found that there was a typology based on three broad categories, one direct access, which mainly relies on licensing and security techniques. Two, dissemination-based access which incorporates de-identification. Thirdly, query-based access, which incorporates, but it is not limited to differential privacy.
We think that this typology helps clarify several points. One is that agencies should begin to look beyond merely improving de-identification methods and also look into the relevancy of these other methods, as well as legal tools to allow them to make sound decisions about data release. Related to that point, I think is the notion that methods for releasing data to the general public are inherently more risky.
In light of these higher risks, it may be necessary to shift away not only from de-identification as a magic bullet that allows research to proceed with no oversight, but a preference for general public release or open data sets. This doesn’t mean that all data has to be behind closed doors and hard to access generally. But on the other hand, the scientific community is by no means as large as the entire internet. There should be some type of compromise that could be reached there, where data dissemination was less fraud, even if that is not as convenient as public release. But it would be less fraud in order to lower some of the risks associated with it.
We also think that there needs to be more collaboration across these methodological divides. Again, that struck me as very surprising at first, that there seemed to be so little collaboration. Upon further investigation, I learned that these are different disciplines with different practices and different methods and objectives with different measures of success with different professional associations.
In point of fact, researchers in these different fields, my impression again, and Brad or Daniel may in fact correct this impression, but my sense is that they really just don’t come together professionally very much. That should be very broadly encouraged.
Then finally to come back in the rest of my remarks to the privacy rule, my sense is that the privacy rule as currently conceived is just too narrowly focused on de-identification. It needs to be framed more broadly in terms of these other methods and techniques, as well, which could provide alternatives in some cases that work equally well, depending on the scenario.
Again more generally, we would recommend that the topic be reframed away from a quest for perfect demonization or methods of de-identification that are almost entirely risk-free, and towards a broader process of management. In the written remarks that I submitted, I talked a little bit about what this might look like, and made the obvious analogy to the law of data security, which I described as process-based, contextual and tolerant of harm, meaning that just as we recognize that there is no perfect security, and that security breaches will occur, I think we have to begin to make that transition in this context, as well, and understand that there is no perfect demonization or perfect de-identification. Some harms will occur.
But the key question then is whether data custodians who are releasing data or transferring data have taken the necessary steps to minimize risk and have done so on an ex ante basis, even as they are then subject to some ex-post review. We identified a number of risk factors, which I am not going to go through right now. But again, we suggested that the broader reforms that should be considered have mainly to do with developing reasonableness standards for data release that could be administered not only by HHS, but by other agencies, as well, with oversight. This may not be of interest to this group, but that would include the Federal Trade Commission, as well, in terms of its oversight of commercial organizations who often engage in very similar types of data release, and the commercial setting based on de-identification techniques which in many cases have proven to be inadequate.
The four basic steps or aspects of a reasonable data release would be very familiar, I think. They would include assessing the data that is going to be shared and the risk of disclosure, trying to minimize the data to be released, implementing reasonable data control techniques as appropriate, and developing and monitoring accountability and breach response plan. Again, this certainly necessitates some changes in the privacy rule. It might require that other technological organization and contractual mechanisms be addressed. Although I think this would still include the de-identification requirements. Rather than experts making a determination in a specific case, it would also be useful if there was some form of certification of the adequacy of underlying processes generally.
I think this would even address some of the concerns that Brad raised about joining databases or different organizations wanting to combine data, if they were all subject to some base-level standard of what types of procedures they needed to follow. This seems to be something that I am beginning to see in the relevant literature. We give an example from the National Institute of Health’s rules around the release of certain types of genetic data that seem to be going in this direction.
I think this would broaden the privacy rule in response to these concerns over the adequacy of de-identification, rather than remaining within this very narrow perspective. Again, I kind of wholly concur with the recommendations that Brad concluded with. I would point out that the proposal we are making would provide ongoing oversight over de-identified data on the grounds that it would no longer be viewed as data that is beyond the scope or the jurisdiction of regulatory authorities. Except in very rare cases, that I guess I am not able to necessarily articulate. But generally, the assumption would be that there is always some level of risk that de-identified data could be re-identified. Hence, the data custodians that are sharing that data or have responsibility for it have an ongoing responsibility.
The notion of a clearinghouse for best practices I think is also an excellent idea. But I would go even broader there in suggesting that those practices should not be limited solely to de-identification, but should also include methods around what we call direct access to inquiry-based access. Finally, I would endorse Brad’s point about needing to develop more sophisticated understanding of risk and the type of processes necessary to address risk. I will stop there. Thank you.
MS. KLOSS: Thank you very much. Dr. Barth-Jones?
DR. BARTH-JONES: So I will start by thanking the committee for the opportunity to speak today. Good morning, and again thank you for the opportunity to present testimony at this very important hearing. My name is Daniel Barth-Jones. I am an assistant professor of Clinical Epidemiology at the Mailman School of Public Health at Columbia University.
I speak from two key perspectives today. I am an HIV and infectious disease epidemiologist. This brings to my view a critical understanding of the essential role that accurate data sourced from the provision of our health care plays for public health, medical and health systems research.
I have also worked in the area of statistical disclosure control since the early 1990s, more than a full decade before the implementation of the HIPAA Privacy Rule. My work in this area has impressed upon me the importance of using modern statistical disclosure risk assessments and control methods to protect individual privacy and the equal importance of balancing these privacy protections with a statistical and scientific accuracy of de-identified data.
Before I go any further to address any issue of possible conflicts of interest, I note that I have provided consultation to a number of organizations interested in the de-identification of health data since 2002. As everyone in this room knows, there is an important historic and important societal debate that is currently underway.
We have a variety of very important initiatives that are going on with regard to health data. At the same time, concerns about data privacy. There it is. We have a public policy collision course that is going on.
I don’t need to emphasize to this committee the importance of the use of health data. We have things like the Precision Medicine Initiative, the FDA Sentinel Initiative, the Patient-Centered Research Initiative, PCORI, the increasing trend and push towards release of clinical trial data and, in general, patient safety.
There is a variety of very important efforts that go on using health data. If we can’t continue to do this using de-identified data, we have to consider the possible harms that comes from being unable to do that. We need to achieve a place of ethical equipoise, where we have balanced harms on both sides of the equation.
There are two important misconceptions about HIPAA de-identified data. The first is along the lines that it doesn’t work, that it is easy, it is cheap, it is powerful. I think that was probably true pre-HIPAA regulation. Supposedly, one could re-identify 87 percent of the US population using five-digit ZIP code, birthdate and gender. Upon reevaluation, it turned out that was 63 percent. In a 2013 study using those same identifiers, 28 percent of the people were re-identifiable using those three pieces of information.
The post-HIPAA reality is that HIPAA-compliant de-identification provides important, although imperfect, privacy protections. The Safe Harbor re-identification risks were estimated to be four in 10,000 by Latanya Sweeney. Post-HIPAA, the reality is de-identification is working. Re-identification is expensive. It is time consuming to conduct. It requires some substantive computer and mathematical skills. It should be rarely successful. When it purports to re-identify, it is usually uncertain as to whether it is actually succeeded.
On the other hand, I think all de-identification experts will admit that it does not work perfectly and it does not work permanently. Perfect de-identification is impossible. It doesn’t free data from all subsequent privacy concerns. There is no 100 percent guarantee that de-identified data is going to remain de-identified regardless of what you do with it.
The Inconvenient Truth, this is my Al Gore moment, is that we can’t have a situation in which there is absolutely no harm on either side of that equation. We have to make a choice. I will point out to you that this is on a log scale here. We have a trade-off between information quality and disclosure protection. We can’t have this ideal situation of having both things be perfect.
What we can do is as we move up this next scale, and so this means we are going from one-tenth of the risk to one-hundredth to one-thousandth to one-ten thousandth. We are making orders of magnitude improvement to get as close as we can to that ideal spot in that tradeoff.
Unfortunately, some popular de-identification methods can degrade the accuracy of de-identified data from multivariate statistical analyses. They can distort variance-covariance matrixes. They can mask heterogeneous subgroups. This problem is well understood by statisticians, but it is not as well recognized and integrated into public policy.
But poorly conducted de-identification can lead us into the area of bad science and bad decisions when we have distorted the true relationships and don’t appreciate that we have done that. I would suggest to you that perhaps that is the greater societal harm that we face if we don’t get this right.
Here is an example. I hate to pick on the authors of a particular paper, but here are the results. A percent of regression coefficients, which change significance under a variety of anonymization methods. Just take a look at the scale here. On the high end, we are looking at almost 60 percent, 80 percent, 90 percent, 75 percent. We are using three different types of regression methods, two different types of cancer datasets. I would conclude if this is what we are going to do to our ability to conduct accurate research by de-identifying, we should all just go home if that is the extent of the damage we are going to cause.
Fortunately, I don’t think that is how it has to turn out. We can do a better job than this. But poorly-conducted de-identification can distort our ability to learn what is true, leading to bad science and bad decisions. This doesn’t have to be an inevitable outcome, but we need to take into account the re-identification risk context and examine the distortion and the statistical accuracy and the utility of de-identified data. Doing this requires a firm understanding and grounding in the extensive body of statistical disclosure control and limitation.
Data privacy concerns are far too important and complex to be summed up with catch phrases or anecdata. Anecdata is collections of data, small observations that you have brought together and repackaged as if it is some kind of meaningful representative data. The kind of eye-catching headlines and Twitter buzz that we get saying that there is no such thing as anonymous data serves some purpose in that it draws people’s attention to the broader and important concerns about data privacy in this era of big data.
But such statements are essentially meaningless, even misleading, if they don’t consider the specific re-identification context, the data details, things like the number of variables, the resolution of their coding schema, special data properties, such as geographic detail or network properties.
The de-identification methods that are applied also need to be considered, and the experimental design of the re-identification attacks need to be critically evaluated. Good public policy demands reliable scientific evidence.
Unfortunately, de-identification public policy has been driven largely by anecdotal and limited evidence, which has been targeted to particularly vulnerable individuals and has failed to provide reliable evidence about real-world re-identification risks. The added problem associated with that is that when a re-identification risk has been brought to life through an attack, our ability to assess the probability of it in the real world may subconsciously become 100 percent, which is highly distortive of the real risk benefit calculus that we face. I would recommend to you the writings of Cass Sunstein on the issue of the precautionary principle and trying to prevent harm on one side of an equation. We also have to take into account potential harms on the other side.
Now, I know you can’t possibly parse this entire set of information on this slide, but I will walk you through some of the key points. You have the slide itself and an extensive set of references that are numbered in the slide. These are some of the more famous re-identification attacks that have occurred since Governor Weld up until last year or two. You will see that almost all of them include quasi-identifiers that would have been excluded by HIPAA. They are marked in red there.
The only exception to that is the Safe Harbor study that was already mentioned by two of our speakers. The press did not cover that story, of course. It is not exciting. Two people re-identified out of 15,000. It didn’t really draw the public’s attention. However, if we look at the headlines for some of these other re-identification attempts, anonymized data really isn’t successfully re-identified.
Ninety-nine percent of the people in the Netflix database, that is in a Supreme Court Amicus brief, is where that came from. How many people were re-identified? Two. There was a hypothetical about how you could re-identify 99 percent of the population, but that was a hypothetical and required you to have certain information that they didn’t verify you could have.
If we look at the sum of the number of individuals who are re-identified across all of these recent re-identification attempts, we have got fewer than 300 individuals. If we look at the sum of the number of people who could have been re-identified and the datasets that someone tried to attack, they actually tried it, we have got over 300 million people.
Is our risk one in a million? No. Have we done good science that allows us to figure out what the risk is? No. We haven’t done that, either. Most of these were against data that wasn’t protected either by HIPAA or modern statistical disclosure limitation methods.
The demonstrated risks are extremely small. Where they are not small, it is against things like the personal genome project using ZIP-five, gender and date of birth. They got 28 percent. If you look at the same Safe Harbor study, which also threw in marital status and Hispanic ethnicity, .000013. If you learn about de-identification and its failure through the news, you are not looking at the right sources.
That said, I think there are six ways in which re-identification science has failed to support public policy as fully as it could. It is attacked only trivially straw man de-identified data. It is de-identified in the sense that the names are removed from the individuals, trivially anonymized. In one of the studies for the personal genome project, the names weren’t even removed. They actually were embedded in the file names and were originally counted in the news reporting as if you could be re-identified from that, if your name is already in it. You are re-identified. You are merely identified.
They target only especially vulnerable subpopulations in many cases. They often make bad worse case assumptions across the board. A corollary of that is that the experiments have often not been designed to show the boundaries of where de-identification finally succeeds. The reason why that is important is that if you look at this, a lot of the authors involved have cried de-identification doesn’t work.
What is important to know is that things like the cell phone data with high resolution of time and days and GPS-level locations, HIPAA would say that is not de-identified data under either of the methods. It would fail Safe Harbor. If an expert looked at it, he would say, that is not de-identified. That can’t be released. It is important to take it to the point where de-identified tells us this now has low risks of re-identification.
Some of the additional failures of re-identification science is that it has often failed to distinguish between sample uniqueness and population uniqueness, and the larger issue of if you are unique in the population, can you actually be re-identified, meaning can you link the population-unique observations to their identities. It has also failed to specify the relevant threat models. Now, this has been the case in a number of scenarios where they are kind of very loosey-goosey about exactly what would need to be done. But it has implied that it could be done.
If you don’t have a full threat model, you can’t really evaluate the risk of re-identification. There is an unrealistic emphasis on absolute privacy guarantees as there is, for example, differential privacy. We have to recognize that there are some unavoidable tradeoffs between data privacy and statistical accuracy, and both have important implications with regard to harm.
What can we do to better inform public policy and practice? We can demonstrate re-identification risk where modern statistical disclosure control has actually been used. We can use statistical random samples and scientific study designs to provide representative risk estimates. We can use ethically-designed re-identification experiments to better characterize the re-identification risks for quasi-identifiers beyond simple demographics. Obviously, people can be re-identified on occasion through diagnoses, through procedures. A gentleman had a penal transplant. That was the very first person to do that. If that hasn’t been accounted for on a dataset, he probably would be able to be recognized. Re-identifications may indeed rarely occur. We need ethical study designs to be able to study how frequently that might occur, so that we can weigh these risks and figure out how to address them.
We need to design our experiments to show the boundaries where de-identification finally succeeds and to provide evidence to justify any assumptions that are made about what a data intruder knows. We need to verify re-identifications. This is important because we need false positive rates in order to say how often would this re-identification actually result in the individual that has been alleged to be re-identified.
We need to investigate multiple realistic and relevant threats, and specify those threat models. Then finally, we can use modern probabilistic uncertainty analyses, which create a distribution of our uncertainty and select from that, so that we look over a great range of what we are uncertain about. We will always be uncertain about data intruder knowledge. But it doesn’t mean that we are unable to probabilistically analyze it using modern uncertainty methods.
We also need to have additional controls, as Ira indicated. We need a broader sense of process controls. I think these things should routinely be required. I think it would make sense to have data use agreements just like the limited dataset has a required data use agreement. I think it makes sense to have a required data use agreement for de-identified data. This should require that individuals not attempt to re-identify or allow to be re-identified any patients or individuals in the data, or their relatives, family or household members. To not link any other data elements to the data without obtaining a determination that remains de-identified.
The problem with de-identified data is de-identified data combined with de-identified data does not of necessity produce de-identified data because it is the entire set of quasi-identifiers in combination that create the re-identification risk. You can take two de-identified datasets, put them together and increase the re-identified risk above what was present in either one individually.
We need to implement appropriate data security and privacy policies, procedures and associated administrative technical and physical safeguards. This allows us to have a better idea of who the anticipated recipient is. If data is going to be released freely on the internet, that really needs to be conceived as a different situation than if it is under controls that allow us to know who the anticipated recipient is. If it is wide open on the internet, then I think we have different tolerance for higher re-identification. Risks in the hands of trusted researchers is a lot different than anyone in the world could get to this.
We need to assure that all personnel in parties with access to the data agree to abide by these conditions. We need chain of trust contracting that keeps once the data is de-identified. If it passes on to a third party, that individual is bound by the fact that first person who got is signed in agreement that they would impose these conditions on anyone they give it to. We create a chain of trust.
What would be even better than that, though, is comprehensive, multi-sector legislative prohibitions against re-identification. Congress hopefully is watching. I am sure they are. What we need is some broad prohibition against re-identification. It needs to allow for re-identification research. That should be allowed to continue under ethical oversight from IRBs. Those kind of exceptions need to be built in. But if we want societal benefit from de-identified data, we need to impose consequences on the individuals who re-identify it without IRB approval.
We also need centers of excellence in which we can have combined graduate training and statistical disclosure, and privacy computer science. As Ira talked about that huge split, I guess you have probably figured out that I am part of the pragmatists instead of the formalist camp. Perhaps if we created environments in which parties from both sides actually sat down and had lunch together, that may be a situation in which we get a more productive dialogue there.
The more important thing is this is graduate-level material. People are not going to learn this with a one and a half or two-day course on how to do de-identification. Just getting through this list of books here is going to take you a while. We need graduate level centers of excellence, and we need support for those.
One thing just quickly before I wrap up because I know Yaniv Erlich will be speaking tomorrow, one thing I wanted to point out about the Y-STR attack that he will be talking about. There is a real question with that attack whether it is economically viable. He was able to re-identify five people and then, through associating those individuals with genealogies, identify 50 people.
In his broader simulation, you will see if we start with 100,000 individuals in the general population, half of those people are not directly at attack because they are females. They don’t have a Y chromosome. Of the males that are remaining, his methods showed that they couldn’t infer a last name at all for 83 percent of the individuals. In the 17 percent in which they could guess a last name, it was incorrect in 29 percent of those cases.
If you were going to try and use this to rule people out of insurance, you have tried to re-identify them. You now have to throw out almost a third of your individuals who are good-paying customers and say, we are going to eliminate your insurance because we think you are a bad bet.
I think we need these kind of economic evaluations of re-identification attacks to see would somebody actually use this to create harm. That is not enough that a scientist can do it in a few cases. It is the economic viability of these attacks that I think we really need to pay attention to. Thank you very much. I appreciate your time.
MS. KLOSS: Thank you very much. Our goal is to take a 15-minute break, and then come back. Really, we have a luxurious amount of time for a discussion. We will come back at 11:25. Then we will have until 12:45 for discussion.
MS. KLOSS: It is about 11:25, so we want to make sure of every minute with our great panelists. Ira has some limitation on his time with us this morning. We are going to ask that we direct questions to him over this next 20-minute period. Just keep that in mind as we let the questions move forward. We won’t limit questions to Ira. We will keep going. But I am going to make sure that we have rounded up our questions for him over the next 20 minutes. We will start with Helga.
DR. RIPPEN: I guess I just wanted to pose a question that may not be answerable per se, but just something to reflect on. Everyone is talking about re-identification as kind of the risk and the area of focus. But even if someone hasn’t been really identified as the individual with all the attributes, what about if they are actually linked and it is wrong.
When we talk about risk, one may need to consider other risks, especially if harm can occur as a result of it, of falsely identifying someone. The other thing is we talked a little bit about risks of 2 to 15,000. I just want to highlight something that is pounded into us. The six sigma, which is risk of 3.4 per million for patient safety. I guess I am just putting out that number as far as what is an acceptable risk or not. That is all.
DR. BARTH-JONES: I would like to respond to the two comments that you made. The first point, there is no current definition of what a re-identification is. That would be a really useful thing to get established in guidance. I would propose that it should involve linking to any of the 16 identifiers that are in the limited dataset exclusion list because those are one step away identifiers.
The only thing that I think might be a useful exception there is for medical devices. I think there is some circumstances under which an exception might be warranted there. If we did that, then the process of even creating an incorrect re-identification, which I think you are right to point out even if it is not a correct identification, it may cause harm. That would still be something that could be penalized. Linking it back to those direct identifiers is perhaps a useful definition there.
The second point is I agree that we have a question about what is an acceptable risk. If you are the two people out of 15,000 that could be re-identified, you might not feel that is acceptable. However, as a society, we also have to look at the idea of what is on the other side of this. If we go for a zero risk tolerance, it is pretty clear we can’t do most of what we need to do to protect people’s health and to create a working health system.
DR. MALIN: Let’s start with the second question first. This notion of what is an acceptable level of risk. I don’t think that we are going to get to that answer today. What I will say, and I will turn to the epidemiologist at the table who may know of this better than I, is that this notion of risk with respect to potential identification is something that has been done way before HIPAA, the disclosure of public health outcomes and public health-related data.
You look across what is going on in the United States for the last 30, 40 years. There has been always this discussion of the disclosure of small cell accounts and multi-dimensional contingency tables. You see questions or thresholds of anywhere from a count of three to a count of like 25 that is considered to be too small to disclose.
Those are quite small when you think about it. It leads to like risks that are way higher than the number that we are alluding to. Those were decisions that were made by public health officials as the statistical agencies within the states. Whether or not those are the correct numbers to use, I don’t know. But if there is precedent for doing something like that, it might be the right thing to do to try and create some type of a standard, such that even the states aren’t doing something in a way that is necessarily contradictory to what the federal level of protection would deem to be acceptable. Again, I am not going to tell you what the exact number is.
Your other question is actually quite intriguing. This notion of what do you do with a wrongful re-identification. This is really just about I think you have to think about the implications. Identification in its own right may not necessarily be a problem. It depends on what you do with that information post hoc.
There is a potential for multiple types of problems or harms in the system. One would be against the individual to whom the information corresponded. If the institution or the individual who actually committed the identification identifies the wrong person, they actually harm themselves as well because the decisions that they are trying to make are actually based on wrong information.
For instance, imagine you are trying to differentially price somebody out of the system. If you end up pricing out the wrong person, you actually induce more hazards for yourself as an organization because you have left the person who would be a problem in the system. It is a double-edged sword in that regard.
DR. BARTH-JONES: If re-identification were prohibited, one of the things that would be really useful is that false re-identifications are probably going to be much more easy to detect. If we enable whistleblowers in that situation, the people who are correctly re-identified are probably less likely to come forward.
People who are incorrectly re-identified are much more likely to be good sources of signal detection for re-identification behavior. I think it is important that we define our definition of what constitutes re-identification to maximize our ability to detect it. That is one of the reasons why I think the mere linkage, even if it is incorrect, re-identification out to be part of the prohibited action.
MR. RUBINSTEIN: On the question of false re-identification, I think that is indeed a very interesting issue. I would just point out that the courts have begun to examine this in a somewhat different setting under the Fair Credit Reporting Act of inaccurate information. At least from a legal standpoint, there are significant questions around what sorts of harms have to be associated with such publication or sharing of false information.
Where someone’s attributes are improperly identified by a data broker, does that harm them? That is a very live issue at the moment. I just wanted to make that clear that from a legal standpoint, the question of the harm associated with false re-identification would be key. There is not that much clarity on what would constitute the necessary type of harm for it to be actionable in the court.
On the other point regarding what is acceptable risk, I tend to agree that remains very unclear from a conceptual standpoint. Zero risk is not something we should be striving for. But at the same time, that may necessitate some changes both in the regulatory structure and in the rhetoric around it. As long as the rule remains that using either of the two methods of de-identification approved of in the privacy rule takes that information outside of the category of regulated health information.
Then it seems that only if there is zero risk is that appropriate. To say that more simply, there is something incompatible about having some level of risk, but then treating that data as unregulated. The solution is not going to be to continue to insist on the zero risk because I think nobody is claiming that is feasible. The solution may have to come from the regulatory side. That means the regulatory burdens would be minimized when proper de-identification methods are followed. But they wouldn’t be entirely eliminated. You would no longer have this category, a Safe Harbor that is entirely safe. It would still carry some burdens.
MS. KLOSS: Thank you. Those are important additional comments. I am going to go to Nick.
MR. COUSSOULE: I guess I wanted to explore a little bit the risk and the harm question. If we never made data available, then you could argue that there is no harm. You are just not creating opportunities. How do you balance off the risk of creating some likelihood or some potential for data being exposed with the value that is potentially generated, which oftentimes is unknown.
The reason I go down that path is trying to understand what the policy implications are and how do you set policy to recognize the risk versus the value to be potentially generated out of that?
DR. BARTH-JONES: I would say two things, Flint, Michigan and Zika virus. The idea that if we didn’t release data, there would be no harm, I think really needs pushback in that. In Flint, Michigan, the crisis with the lead in the water was detected by a doctor who looked at data that was in the electronic medical records. He was able to see that this was a widespread phenomenon. I think the US public actually wants those sorts of things to be done with the information that we have available. They don’t want to personally be re-identified. But I think they want those kind of public health revelations to occur.
If you consider Zika virus, if we were in a situation where Zika was not yet known, and therefore could not be subject to public health reporting, we had no other mechanism, we would be stuck with what epidemiologists call syndromic surveillance. We would need to be looking for children who have microcephaly. I am pretty sure the US public would also come down on that side, too. We as a society have a need to be able to at least count what is going on with regard to the health of a population as a whole.
Where is that information? It is inside our health care system. We have to design a system that protects privacy and yet allows us to do that kind of counting. To figure out the cause of a disease because we have confounders, effect modifiers, things that in statistical analyses have to be counted for that can confuse what our knowledge of the cause is. We need individual-level data in order to be able to sort that out. I would push back the idea that there is not harm if you don’t release is actually not true. I would say the bigger harm is in across the board deciding that we can’t release because there is some small privacy arm.
DR. MALIN: One of the greatest challenges with responding to your question is that we treat the health care sector as if it is all one sector. We have many different types of information that are being generated within the health domain.
They bring about different use cases associated with them, such that they emanate out into other types of environments. We talk about insurability. We talk about employment. We talk about stigmatization with respect to certain types of disorders that have just public types of effects.
This notion of what exactly are the harms, they are innumerable. It is a really challenging problem because I don’t know how to answer this question for you at the present moment in time. It may actually be worth trying to put together some type of documentation of what are we worried about with respect to the use of data, specifically certain types of health data.
I think you are going to see different types of problems with respect to general clinical data versus psychiatric data versus genetic data and potential pre-determinism and all of these things. They all have even some of their own laws associated with them, which I will go on record again and say I do not believe in exceptionalistic law. I don’t agree with, for instance, things like Vena. But I think that until you actually know what that boogie man is that is out there, it makes it very difficult to determine what exactly it is you are supposed to do to protect the information.
MR. RUBINSTEIN: I would add a general observation about the question of releasing data. It is certainly the case that if you never make it available, then you have eliminated the harm. There are a lot of analogies to that throughout the world of data security. If you set your spam filters to filter all email, then you will reject all email. You never get spam, but you never have any communication with anyone. You have much the same problem here. You never get research value if you prevent the data from becoming available.
One of the interesting observations in reviewing the literature is that there is a tendency to view this from an all or nothing standpoint, and a general preference for public availability, for public release, which just may no longer be appropriate given the advances in techniques for linking dispersed datasets and the wider availability on an anticipated basis of new datasets being made available.
I think a more nuanced approach to what data release means is also required. If you could have 100 percent confidence that data was de-identified, this wouldn’t be an issue. But that is not the case. Or if you could always generalize data so much that it wouldn’t present risk of individual harm, this wouldn’t be a problem. But then you wouldn’t get much research value.
I think the other variable is what kinds of barriers to access might be required. In the cases that Daniel just alluded to of Flint and Zika, I am not suggesting that the data that members of the relevant professions would need to begin thinking about these issues from a public health standpoint be locked up in some way that makes it extremely burdensome or inconvenient to gain access to it.
But at the same time, I don’t need access to it. Most people with internet connections don’t need access to it. The relevant community of scientists and professionals is large, but it is not immense. It is not equal to the size of the internet. Another variable to think about is how to share data broadly enough to allow valuable scientific research to occur, but without necessarily making it publicly available in the sense that it now persists forever and allows linkage with arbitrary future datasets that are released.
MS. KLOSS: Thank you. Vickie, you are next.
DR. MAYS: I guess I am still struggling with the risk question, but I am going to put that aside. I want to ask Professor Barth-Jones a specific question. On one of your slides about recommended de-identified data use requirements, one of the things you talk about is not link any other data elements to the data without obtaining certification that the data remains de-identified.
When I think about that that would just almost paralyze me as a researcher, if I understand it correctly. I am geocoding data. I am doing a lot of things that potentially may end up with a risk, but I don’t know that. If every time I wanted to do that, I had to go somewhere for certification, I am not sure what I would do.
DR. BARTH-JONES: The unfortunate reality, though, is if you had one dataset that had a high geographic resolution, and was considered de-identified because it had no demographic characteristics, and another dataset that had a lot of demographic quasi-identifiers, but no geography, and you had some way of linking those together, the new data set that you created would no longer have very small re-identification risks.
The idea that once data is de-identified, it will permanently remain de-identified is a myth. There needs to be more expertise, so that there are more experts that can help researchers with this kind of evaluation. That is part of the reason my recommendation is we need a lot more training in people who know how to do de-identification and do it well. It is unfortunate, but I think to protect against being able to combine de-identified datasets without any further concern as to whether they remain de-identified, I think there is an important risk that is there.
DR. MAYS: I want to take two use cases. Say, for example, I do it as a researcher, I have an IRB who is guiding me at every step as to what to do, how to do it and what my responsibilities are. But if I am a data entrepreneur who knows that putting those two datasets together, I am now going to have a product to sell. There is a very different set of guidance.
We are kind of getting back to your comment to some extent. Different data has different levels of risk. They have different outcomes. I am really trying to figure out how I would do what you are suggesting, which is to get some kind of certification. Like from who, how would we do it?
DR. BARTH-JONES: One thing that is important to note is that HIPAA does not require that it be an external expert. It is someone who knows the scientific and statistical principles. You could do it yourself if you had sufficient background to do that training.
If you are doing things like adding in, you have got the state of Michigan present. Well, characteristics of the state of Michigan can be associated with that. It is individual-level linking that my recommendation applied to.
MS. KLOSS: Ira, anything to add to that?
MR. RUBINSTEIN: There is a trend in the privacy literature towards responding to these types of concerns with a new emphasis on regulating uses. I think that is partly what Vickie is alluding to. I think it might be a function of both how the data was originally shared and made available to her, if it was under a set of restrictions and subject to a data use agreement.
Now, even in internal certification along the lines that was suggested. There were no plans to share the data more broadly or to use it. I mean, it would be for different research, but it would still be under the broad category of health research or other publicly-valued research as opposed to commercial use.
Then those might be relevant factors that could be set out in a regulation. You would have a matrix that looks at all of those factors and again tries to move away from an all or nothing analysis. That might be the way to make it a little more manageable.
DR. SUAREZ: I have a few topics that I wanted to explore with the group. I will start with two topics. I noticed a few slides talked about it. What is the degree to which these two examples of the type of data are going to increase challenges related to de-identification because of the nature of the data and the availability of that data in some cases.
One of them is genetic information. We have more genetic and genomic information about individuals. One would think having a genomic profile without any connection to anything would be meaningless. But in some cases, there could be possibilities of linking that based on the profile itself. That is precisely the kind of thing that is done in legal cases and other things.
Genetic information seems to create an increased level of potential risk perhaps. I wanted to explore that with you.
The other type of data is social media data and how much the availability of social media data that is available across the board can actually present an increased risk for re-identification of otherwise soon-to-be de-identified data. That is the first one. I will go into the other topics if we have time.
DR. MALIN: I really don’t know if I want to touch this with a 60-foot pole, to be honest with you. You heard my comment on genetic data earlier. I think let’s start with that one.
There are two ways you have to be concerned about genetic information. The first one is its uniqueness. The extent to which genetic information in its own right serves as the key is dependent upon your ability to link it to some other resource that has an identifier associated with it.
We published papers on this almost 20 years ago. If you have the same individual showing up at multiple institutions, they participate in multiple buyer repositories. Each of those institutions will know the respective identity of the individual. They may have only a snapshot of the individual’s clinical persona or their history. Yes, that would definitely lead into a situation in which the genomic data could serve as an identification tool.
Which is why I don’t think that you give genetic information passage under something like Safe Harbor. I think you have to understand the anticipated recipient at that point. Let’s flip that over to the second side because I think that first one says know where you are giving the genomic data to. I think if you take genomic data and just throw it up on the web, and expect it to remain protected in perpetuity, I worry about that one.
The other side of it is the inferential disclosure component, where you are seeing that genomic data has the potential to intimate certain aspects of an individual, like their race or ethnicity, obviously gender. You are certainly going to have certain types of clinical factors.
For instance, you are clearly going to see like Klinefelter or other types of trisomy or even quadsomy disorders. You may end up getting statistical averages with respect to their height and propensity towards having obesity. For the most part, at the present moment in time, it is still very far-fetched to say that you would be able to discern an individual’s identity directly from their DNA.
Really, what you end up doing is you end up trying to predict a lot of the same factors that are getting taken into account in risk assessments anyway. It is one of those situations where it has the potential to be translated into something that you may use for identification purposes at that point. I still think it can be taken into account in a risk assessment.
Your second question about social media, we could probably spend a year discussing. I think one of the biggest challenges you have with social media. A, you have self-disclosures. If an individual chooses to self-disclose, there is nothing that you can do about that. They have to know that they are taking their rights in their own hands at that point. Whether or not you have to take that back to social media companies and say that is something you have to put in your disclaimers when using these types of resources, I don’t know.
The other concern I have with social media, and we actually just presented a paper on this three days ago, is that we are finding that social media is enabling many people to disclose the health information of other individuals, which that is really quite interesting because there are some concerns about consent there.
What we are finding, for instance, the title of the paper that we just published was called #PrayForDad. What you are seeing is that people go into surgery or they have some type of a debilitating disorder, like Parkinson’s or Downs, which obviously is an observable physical phenomena. But they are using social media as like a coping environment. They just automatically throw this stuff up.
There is this disclosure of like my dad is in surgery right now. Please pray for him. Okay, so now you know the time. You know the date. This is a concern. But this is again people taking their lives in their own hands at that point, which I think is a different situation than saying can we protect information when we just put it out there under the expectation that people have an expectation of privacy. If they promote themselves out there, I think they are relinquishing their right to an expectation of privacy. I think we have to distinguish between those two.
DR. GARFINKEL: I was going to say that in our report, we say that another issue with genetic information is the disclosure of heredity information. It may be that the person is not in the dataset, but their family members are in the dataset. They can be found that way. We also need to remember that not all individuals have a unique genome. There is the possibility of misidentification.
DR. BARTH-JONES: The last slide, I don’t know if it got included. I actually discussed this idea that although de-identification of genomic data is admittedly non-trivial, there is a high intrinsic uniqueness as Brad indicated. Individual consent is not going to solve our ethical and privacy challenges here. Disclosing my situation, discloses for all of my relatives past, present and future. We don’t have a neat, clean solution to this.
DR. MALIN: There are other groups that are wrestling with this. I will let Ira jump in.
MR. RUBINSTEIN: I just wanted to add the perspective that the National Institute of Health issued genomic data-sharing policy recommendations in August 2014. From my perspective, I found them quite interesting because it was clearly a case of I am not abandoning de-identification given some of the concerns just raised, but supplementing it with controlled access methods, with codes of conducts, with data use agreements, with new security requirements, with new consent protocols seeking consent for broader subsequent sharing.
I think that as a general model, that is very desirable. It is kind of a belt and suspenders approach. But I think it is appropriate here because there is sufficient concern around whether these issues of uniqueness or inference or such that de-identification alone won’t be sufficiently protective. But again, rather than conclude that de-identification serves no purpose. NIH sought to bolster it with these other methods and controls, including some legal controls. At least from my perspective, that seems like a very good model to consider going forward.
DR. MALIN: Could we follow up on the NIH policy for a second? I actually have concerns with going with the direction that the NIH policy has gone, their geno data-sharing policy. The biggest one is that they actually walked away from specific consent in it, and they went towards a broad consent model. In many sense, it is a little bit of a fleecing. Basically the data, you raised a concern before, or somebody raised a concern before, about do you bias the population when you have to get their consent.
What happens now is that the NIH is requiring all of us to get broad consent to have information deposited into the database of genotypes and phenotypes, which is mainly what is being overseen by the GDS policy. This has actually taken this notion of fine grain consent off the table for individuals to have their information reused at a later point in time.
I think it should go one way or the other. Either you take consent off the table, or you put fine grain consent back on the table. This notion of broad consent, and that is your only option, I think is a very dangerous precedent to set. I probably just shot myself in the foot for getting future NIH funding.
DR. PHILLIPS: I want to come to one of the points that Dr. Rubenstein raised. That was to move beyond de-identification and really incorporate the full gamut of statistical disclosure limitation into the rule. I didn’t hear that from anyone else. I just wondered if others had comments about that. It is actually in the guidance document that was put out by OCR.
DR. MALIN: There are a bunch of examples of techniques that could be used. Even the original privacy rule alluded to federal committee statistical working methodology paper 22, which has been around for decades. This notion of statistical disclosure limitation techniques has always been available. It is just a question of when do you apply it.
I think that there has just been different perspectives. Most of the statistical disclosure limitation methods have been focused on aggregate statistics. There has been some work done with like swapping and perturbation of individual-level records. But a lot of it has been focused on publishing models or multi-dimensional contingency tables, which is aggregate statistics. A lot of the informatics in the computer science community has gone towards this direction of if we are going to take individual-level records and put them out there, how do we do that?
There has been a little bit of a disconnect and a little bit of jockeying for position of who has the right methodologies when. But they are all available. They have all been promoted.
DR. BARTH-JONES: I would add that those recommendations that I made about data use requirements, prohibiting the re-identification, not linking, maintaining appropriate security controls, and establishing a chain of trust. Those are, I think, part of a broader statistical disclosure limitation vision. I would like to see those not just be contractual, as they usually are right now, but have some force of regulation behind them. I think that is part of a data enclave-type situation, where you have trusted researchers that have been vetted. I think we can tolerate different re-identification risks.
I think it really, as Ira has proposed, requires a melding of a lot of layers and multi-dimensional solutions here. I am an HIV epidemiologist. When did we start making success with HIV? When we threw a whole lot of drugs at it at the same time, so that if one thing didn’t stop it, another thing did.
That is one of the things we need to do with privacy as a whole is view it as a systems problem and apply systems-type approaches to, if one thing doesn’t stop it, perhaps the next thing will, perhaps the next thing will. We need multi-dimensional solutions. De-identification is important because it disincentives people to attempt to re-identification in the first place. We should never get rid of it. It also shouldn’t be the only thing that we are relying on.
MS. MILAM: I have got two questions. The first one is really along the same line as Bob’s. We have heard today with Safe Harbor that the re-identification risk could be .01 percent or up to .04 percent, but both small risks. I am wondering if there have been any studies, or if you all have done any work around application of this SDL techniques on top of a Safe Harbor dataset. For example, generalizing instead of ages, age ranges, that type of thing.
I heard earlier K-anonymity is probably not a good technique to use. Minimal cell sizes on top of Safe Harbor for data out on the web, not data where you would utilize a data use agreement, but for data that is freely available. I would be interested in your reactions.
DR. GARFINKEL: I would like to correct two presuppositions in your question, though. The rates that you quote for Safe Harbor re-identification, that is a specific study looking at a specific set of identifiers used for re-identification. Other data in a de-identified Safe Harbor protected dataset could make it easily re-identifiable. That data could be released under the Safe Harbor guidelines.
For example, it could list favorite color. It could list child’s favorite color using Brad’s example. There are lots of information that could be used to change that re-identification rate. You said you heard that K-anonymity is not a good technique to use. I didn’t say that. I said that it is a framework that you can use for measuring the effectiveness of de-identification that has been used.
Different values of K produce different amounts of data quality degradation. It is a technique for making the measurement. There are algorithms that implement a K-anonymity result after you apply the algorithm to the dataset. They will result in different degradations to data quality.
DR. BARTH-JONES: I think we haven’t studied what the risks of Safe Harbor are anywhere near sufficient level. The studies that have been done tended to use a few quasi identifiers, like in the pray for dad example. We know that information is being released with great detail on social media.
What I would propose is that we probably need to do some ethical study designs in which you have a group of people who go out. We take a random sample of individuals. We see how much can you figure out, have at it at the internet. We now take just the quasi identifiers from those individuals, strip off the direct identifiers and hand it over to another team, who only has surgery took place at this point in time. It was this type of surgery and those kind of details.
Then see how many individuals and datasets can be re-identified. I think studies where you divide that up, so a person’s privacy is not being violated by the study design, can ethically be conducted. We can be more rigorous in looking at Safe Harbor. But certainly with the evidence we have been provided so far, I think the idea of eliminating Safe Harbor, as some have proposed, really would have more harm on the other side of not having any mechanism that is easy that people can implement to be able to release that kind of data.
One of my pitches in what are the pressing issues and what can be done, I would like to see more HHS-sponsored research in which things like this try to be tackled, rather than our waiting for an independent researcher to go ahead and do something, re-identify a few highly-targeted individuals, go straight to press as has sometimes occurred, not even to peer review. The better papers have gone to peer review.
Do some sponsored research and make sure that this information gets disseminated to the public. If it turns out the re-identification risks are high, then let the chips fall where they may. We need to change policy. But let’s at least measure it properly.
MS. KLOSS: Have there been any assessments of the gap in skills and competencies in our health system to oversee or to manage the de-identification process? Do we know anything about the size of the people side, the gap in the people side, what is needed? How much of a training lift is this? I appreciate the comment that this is master’s level work. Given the scope of health care, we are going to need a variety of solutions. I just wondered if anybody has done any of that assessment.
DR. MALIN: I am not aware of an assessment. I could go anecdotally and tell you that number is exceedingly small. Then there is also this question of how big of a training program do you need to really push this forward. I don’t think that we know that yet. It would be a wonderful investigation, but I also think it would be a very challenging one.
One of the main reasons is that, I was alluding to this earlier, in that the health care concept, what health care is and where it lives, is extremely diverse. If we just start talking about all the data that gets generated in academic medical centers, it is over 100 academic medical centers in this country easily. There is not a statistical expert that is going to be embedded within each of those.
I think if you were to talk to a general biostatistician, they would tell you they don’t feel comfortable doing this type of assessment because they are not trained in it. Most informaticians are not trained in it. Most computer scientists are not trained in it.
If you then go beyond the academic medical centers, you now have every health data-managing company in this country. The number of start-ups that came out over the last three years alone would just easily squash that number 100. Then it is a question of how much support do you need for this.
The other thing I will say is de-identification is one tool for sharing data. It is not the only tool. Data is not always going to be shared in the de-identified format. There is going to be lots of organizations that figure out how to create agreements that allow for the flow of information, either in a semi-identified limited dataset way or in an identified way.
I think that we are going to see other models where it might be that the identification is not necessarily appropriate. There are just lots of studies that you can’t do with de-identified data. Just to come back to your original question, I think we don’t know the gap.
DR. GARFINKEL: There are also a lot of de-identified data which is also pseudo-identified with pseudonyms, with coded data. Coded data, correct me if I am wrong, can be considered de-identified under HIPAA. That is an issue.
MS. SANCHES: I am sorry. You are asking if what kind of data? I can’t answer that question.
DR. BARTH-JONES: If we took people who were prepared with master’s degrees in biostatistics and epidemiology, and all the rest of the training that they get that is important to do this well in terms of understanding medical informatics and all the rest of that background, I do think it would be possible to get most of those people up to the level where they importantly contribute to the need for de-identification expertise by adding as few as two to four courses beyond what they already get.
I think it is not an insurmountable addition to try and pump the additional support into being able to do that. Like I said, I recommended centers of excellence. I think having combined education opportunities and actual practice of de-identification in the same universities is really a model that we should be building.
DR. SUAREZ: I hear a lot of very good highlighted points. I think I am just listing them in my own list here. I wanted to just make sure I understood a few of them well, particularly folks on this policy area.
The first and most interesting one to me is the volume of cases documented or otherwise known of instances where there has been a re-identification, by brute force or some other way, is very limited, very minimal. It seems to me it is almost goes to say the other big message which is de-identification seems to work. There seems to be safe and reliable ways to de-identify data.
I think a couple of you mentioned how re-identification is expensive, is costly, is difficult. You need PhD degrees and all those things to do it. There is always the concept that the risk is never going to be zero. There is always going to be risk.
The other one is that the data is not going to be de-identified forever. In other words, it is a time issue around the concept of de-identification. The other thing that was mentioned was that the current methods in HIPAA, the Safe Harbor, the statistical analysis, are not necessarily sufficient. We need more. This is not a one single way to de-identify works. There are a lot of other elements. It is a multi-dimensional approach that needs to be applied.
That is where I want to try to focus on my question, which is I am trying to identify policy drivers that would help us improve the degree to which data is protected from re-identification. I think the way I see things, there are two areas for policy direction. One is on the end user, enforcing things or requiring things on the end user, end user meaning the recipient of the data that might be using it.
There is that idea of the law prohibiting the re-identification. There is enhancing or expanding the regulatory uses and controls that the user can give. There is stringent penalties and consequences, if you try to. That is a number of policy actions that you can take at the end user.
Then there is a number of possible actions you can take at the level of the organization releasing the data. One of them is supplementing the de-identification techniques with additional, as Ira mentioned, statistical disclosure limitations, including things like risk assessment and other elements. I wanted to explore and see if you had other concepts around these policy drivers, either at the end user of the data or at the organization that is disclosing the data. Any thoughts about it?
DR. MALIN: I don’t know all the answers to this yet. I mentioned Ellen Wright Clayton, who I have been working with on this topic. The NIH actually just sponsored a center of excellence in ethics research for us to actually look into some of the risk-based aspects of privacy, specifically with respect to genetics. That is going to be a project that goes on over the next four years.
The challenge for us is there are many different ways to control the system. You can put laws in place. You can put economic disincentives in place. You could try to put pressures in terms of controlling the system with respect to where risk lives. What Ira was alluding to, I didn’t really view this as different types of systems, just different ways of controlling risk in the system.
If you have an enclave, for instance, you are basically limiting the number of people who can come in. You are credentialing them. Credentialing can go a very far way in that respect. You are providing oversight with respect to what they are doing because you have direct audit because you are in your environment. So auditing, unique logins can also be used as deterrents.
The next level up is you are opening up the system a bit more. You may be reducing the fidelity of the information. You may be requiring that they go through some type of semi-credentialing process. I am just saying that what this means is that there are levels of risk.
The concern that I have is if you specify it as three, what happens when there is four? What happens when there is five? My suspicion is that over time, there will be innovations for managing access. I would be hesitant to say that we should just define what these three levels are and just be done with it. I think it would preclude flexibility moving forward.
I would say that you should consider these other notions. The one thing I would not want you to do is create a law that bans re-identification. I actually think that this could be misinterpreted. Unless it has appropriate carve-outs, you could end up stymying research in privacy.
You want to do this in a way that says it is use-driven. If there was re-identification with intent to harm, you might want to preclude something like that. But if it is re-identification just for curiosity purposes, you don’t necessarily want to penalize that in the same way.
I will get two-faced on you for a second. I worry about people who are like academicians like myself that don’t necessarily have guidance on how they are supposed to decimate their findings. There has been decades of research in ethical hacking, where you commit the attack. You show the exploit. You go back to the place that has the problem and you fix it. Give them the opportunity to fix it before going public.
In an information security setting, that works really well. You find a zero day exploit. Or you find some type of a port open, and you fix it. It is not open anymore. With data that has been put out there, you can’t pull the data back. I don’t know exactly how to create a similar type of environment with respect to data privacy as we would have with information security in that regard. I will just say to look at it as every possible type of deterrent that you can set on the table. Then you can tune the knobs accordingly.
DR. BARTH-JONES: I want to add my voice quickly to what Brad just said about re-identification science. It is really important that we get a good carve-out that makes it clear that can be done without the oversight. We would be harming ourselves more in the long run if that was something that you ended up stopping by prohibiting re-identification.
MR. RUBINSTEIN: I would endorse that, as well, and make just two additional observations. One is that we are currently under a regime that focuses heavily on de-identification. We are now beginning to discuss whether to add other actors. That naturally raises the concerns, as someone just mentioned, about whether it is three additional factors or four additional factors, and how do we decide that.
I think part of what that implies is that there is going to be a need to create some mechanism for revisiting these issues on a more regular basis. I am not sure at this point what to recommend. The usual method of infrequent revisiting of formal rule-making just may not work here. It may not be sensitive enough or rapid enough to deal with innovations either in what type of data is being released or what type of re-identification methods are now available. I think that is something that you should contemplate that you need to create a mechanism for almost annual review that reconsiders any factors that have been identified, and may add or subtract from them on an ongoing basis, rather than a set of regulations that is just fixed in stone.
The other observation I wanted to make, and maybe at the risk of kind of reopening this poor issue of re-identification or the severity of the risks associated with that is in my research, for what it is worth, in reviewing some of the literature that takes a different stance and characterizes these re-identification risks as much higher, as much as it has been criticized by Daniel and Brad and others.
I think it is important for the committee to recognize that if they heard from some of these computer scientists on the other side, they would hear equally broad and impassioned statements about the risk being quite different from how they are being represented today. Again, I am not in any position to judge or evaluate or say that one side is right and the other side is wrong.
But there is this other literature out there that takes a much more skeptical view and is less impressed by these empirical studies, and the prospects of re-identification much easier than is being represented here. I think you at least need to be aware of that.
To me, the lesson to draw from that is the need for cross-disciplinary collaboration to see if there is some way to understand to the policymaker why there is such vastly different views. Then to figure out what to do about that.
DR. MALIN: I will only supplement that by saying you will hear from two of them tomorrow.
MS. MILAM: I am interested in hearing about de-identification of survey data. When you look at survey data, what conditions, what do you need to evaluate about the data and about its collection methods, so that you can ensure it doesn’t approximate record-level data?
DR. BARTH-JONES: One of the key issues in survey data is what is the sampling plan that is behind it, and can someone reasonably know whether the individual who is in the sample data that is being reported, indeed was in the sample. If someone already knows that you are in the study, and it is a unique observation, then that individual is going to be more easily re-identified than if they had to be re-identified from their quasi identifiers in the broader population.
It is possible that one could oversample in order to avoid this, and then not report on the entirety of the sample. But actually take a sub-sample of the sample in order to introduce that confusion. That is something that is almost never done. It involves expense. It does provide a potential solution in that kind of situation.
If we assume that indeed the person is known to be present in it, for example, if you reveal to your neighbors that you are in the American Community Survey, there is probably a lot more risk of you being re-identified in the public use microdata files from that. There are solutions in those situations. Some of them are statistical. Some of them are on the end user in terms of controls that would imposed on people who get access to the data.
DR. MALIN: It really depends on the type of survey that is being run. The context is key. The semantics of what is communicated within the survey is also key. If you can create a formal model for what exactly is it in terms of how it was collected and what is in there, you can run a risk assessment on it, as well. You could possibly even translate it into a Safe Harbor environment.
I think we have seen similar things, for instance, with clinical trials because trials have a similar type of situation in which you have specific, very clearly defined recruitment program. There are a lot of surveys that are often run within the context of the trial, just to see how individuals are progressing. Just asking them general questions about everything from what is their occupation to how are they performing in their occupation, how is their family viewing their progression, things like that.
There has been a lot of work done in de-identification of clinical trials over the last couple of years. There is even a lot of sharing of that data going on now. You could possibly relate this to it. I would say I am not an expert in survey data.
DR. MAYS: One of the things that we have been doing is talking a lot about researchers in terms of their access to data and what happens in terms of it getting re-identified. Researchers are a great case because we have people who, one, we can go to or, two, that we have an IRB.
One of the things that HHS is interested in is pushing data out the door and having like data entrepreneurs being more involved. Sometimes when we are talking about researchers, I have a sense we have this value that the researchers are going to do good. But as soon as we start talking about commercial use, we get a little more worried.
I would like to hear what your thoughts are about the issue of data warehouses, data entrepreneurs, Google, other sources that are really invested in putting data together. There should be a different kind of use case if this data is commercial? Or is it kind of we have this value, and this value is going to be the same for all?
DR. MALIN: I should caveat by saying that I am sponsored both by Alphabet and IBM. I have grants in my lab that come from them. Particularly with respect to the Precision Medicine Initiative because we are running the pilot program with Verily under Alphabet.
I think that it is honestly a wonderful thing to push data outside of the traditional domain and put it into the entrepreneurial market-based environment. In many respects, at least I have worked in that domain without disclosing anything beyond an MBA.
I can say that for the most part, the companies that I have worked with really just want to do good with the data and are trying to build new types of ways of analyzing data and provide feedback across the system that they view to be fundamentally broken in terms of the way quality is evaluated, in terms of the way outcomes are assessed.
But I don’t think you can ever completely anticipate what is going to happen outside of an academic research environment. One thing that I do worry about is that if there are no limitations of what you can do with data, you could end up in a situation where it turns into data exploitation simply because it is a free for all at that point.
Does there need to be a different use case out there? You may want to go to a situation where you say like data that comes out of covered entities does have certain limitations of what it can be used for. But I do worry that if you go that route, you will stymy innovation again. Until you see it being used, I don’t think you are going to know what exactly it is you are looking for yet.
That again comes back to what Ira was pointing out, which is that it is probably something that you should be revisiting as the market unfolds. I think the market is evolving extremely rapidly. I know we say that about a lot of things. But I have been involved in de-identification in the use of health data for over 15 years.
I really feel like it has hit the point where it is moving away from the very beginning of the S of the innovation curve. You are just like in that exponential growth in terms of the collection and its use. I think if you revisit a year or two years from now, you are going to have a much clearer picture of what is going on out there.
DR. BARTH-JONES: I think it is also important to consider that we have a sectorial policy law structure here in the United States and that I don’t think people anticipated the extent to which by HIPAA doing a relatively good job of locking down health privacy within the system. It created a driving force that is being capitalized on outside of HIPAA.
There is a lot of health information outside of HIPAA that is not regulated. When we do de-identification assessments and determinations, trying to figure out what information is reasonably available to the anticipated recipient is becoming a much and much more complicated issue. This committee isn’t going to single-handedly address the issue of multi-sectorial privacy law in the United States. But I do think it is important to realize that, unlike other environments where there is a broader regulation of privacy and data privacy, we face some unique challenges here because of that. That really impacts on this question.
MR. RUBINSTEIN: I would add that on the question of making this data availability or the innovation benefits that might result, that there is also a difference between making available in such a way that new methods of analysis might be applied or research in the same area of public health is carried out, but by different entities versus incorporating released data into just commercial services for the benefit of supporting more granular targeted advertisements or other direct business uses along those lines. It is important to keep those points separate. It might be that there is greater support and tolerance for the former, but that doesn’t necessarily have to extend to commercial uses that are just directly benefiting the bottom line of the companies who gain access to the data.
This raises a host of other issues, but maybe it is still worth putting out there. There is also just the question of whether this is a one-way street or a two-way street. Clearly data scientists are interested in gaining access to the data bank that these large internet companies maintain and are very reluctant to share. There might be some thought given to whether that could become reciprocal rather than just a one-way street.
DR. MALIN: So, just to recognize, we shouldn’t call these all health data companies because they are just data companies. A lot of them are outside the scope of HIPAA. Some of the big genomic firms, for instance, like a lot of the sequencing facilities are generating a lot of genomic data that is not under HIPAA at all.
We are starting to see the potential for it segmented or differentially regulated environment for the same exact data. I don’t know how to solve that one. If you can solve it, that would be wonderful for me. It is coming around pretty quickly. That doesn’t include 23andMe. It is talking about sequencing.
MS. BERNSTEIN: I am wondering if you imagine you are a new federal agency that collects lots of information about consumer. You were putting together a new program for how to make that data available to the public, whether it is researchers, consumers, the press or whatever. What kind of program would you start to put together now, starting today? Or what elements of such a program do you think are necessary to have a robust program in place for an agency like that? Just asking you kind of the policy questions.
DR. MALIN: Can you provide a little bit more scoping here? You just alluded to like a federal agency that is just going to collect a lot of data.
MS. BERNSTEIN: A lot of federal agencies collect a lot of data. We have tons of data at CMS, whatever. There are other agencies out there who collect a variety of kinds of information about consumers, whether it is human services kind of data from beneficiaries or financial data from the variety of financial regulators and so forth.
An agency like that that is collecting data about consumers, I am just thinking about it. I am sort of asking, well, if we were starting this business today, rather than we had started it in the ‘90s, would you do something differently. Is there something that you would put in place now that you didn’t? Would you think about it differently because we are here today than we thought about it? Would you start differently? Does that make sense? Not really?
DR. BARTH-JONES: If you are the design phase, and you are trying to figure this out is bring in-house de-identification and re-identification experts. Make sure you get at least one formalist and one pragmatist. Listen closely as they fight it out. Seriously, there are valid perspectives on both sides of this equation. I think having more in-house expertise in these kind of issues where it is not just somebody that you are talking to for an hour or a couple-day consultation, but somebody who you share space with and can help from both sides of that perspective. I would say that would be an early hire in my book if you are building something like that, at least two on both sides.
MR. RUBINSTEIN: I think this is a very relevant, but difficult and complex question. I have started to do a little bit of inquiry into maybe what you are alluding to in the background here, which is the open data movement. Many of the new federal initiatives are requiring agencies to have plans in place for making data more widely available.
I haven’t studied it sufficiently to offer any real analysis or recommendations. But I have noted that when I look at those policies, there is usually a one-line throwaway to the effect consistent with appropriate data protection requirements. That is the end of it. I don’t think that anybody has really fully figured out how to do this properly, how to both serve the goals of openness in a data environment for purposes of research and innovation and so on, but to do so in a way that respects privacy. I think that it is a big looming topic that requires a lot more.
DR. MALIN: I think about this a little bit more. I would say over the next year, there is going to be a grant experiment run with the Precision Medicine Initiative. That experiment is starting with the pilot that we are running.
We are in the midst of working out how to do ingressing of data in a privacy-preserving manner, trying to figure out how to set up management facilities for holding, for the ingressing as well as the management after we have ingressed it. The amount of data that we are generating is quite substantial. We are talking upon petabytes to potentially exabytes of data, which was one of the reasons why we partnered with Verily on it.
We are also in the midst of setting up two-tiered access strategies both for credentialed investigators to get individual-level records, as well as create public front-facing query infrastructure for just getting some general intuition into what is going on with that data. But we haven’t done it yet.
MS. BERNSTEIN: This is very interesting actually. I mean, I couch this in the idea of a federal agency doing this. But really any entity that has a lot of data about consumers, it doesn’t really matter whether it is a federal agency looking at its own data or some other commercial entity that has consumer data that wants to be able to share in a commercial way, like you were talking to, for the betterment of society or whatever. Yet, wants to do that in a privacy-protected way, not just the stuff that you are talking about with the technical DID. What other process or functions or policies might they have in place. I think one of the things that it is interesting is the idea of a data intake group of some kind.
DR. MALIN: There are multiple committees that you need in that type of a program. In addition to setting a data access committee that PMI is going to use, we are also using just scientific study design committees that are going to provide intuition into the types of data that should be collected and could be collected. And how reliable that data is going to be because we are going to try to mitigate as much noise as possible, which dirty data in, dirty data out.
At the same time, I will reiterate something I said to the committee almost three years ago, which is that one of the things that we keep finding is the most important, is the public trust. In that if you continue to create de-identification opportunities in a way that completely discredits or just holds the constituents of whom the data corresponds to off to the wayside, and the day that you end up with a major breach or a major misuse of the information that violated a contract, it has the potential to be detrimental to the future of research for potentially five to ten years.
One of the things that we did at Vanderbilt when we set up our de-identified research environment was we created a community advisory board. If there is going to be a federal agency, for instance, that is going to be collecting lots of data on the general public, I would want to know that the public had some type of a say in terms of how that information was coming in and how it was being used. I think that would be important.
MS. BERNSTEIN: Do you think commercial entities do something similar to that? Do they establish relationships?
DR. MALIN: I think the smart ones do. I think you should. I think it is a fair question.
MS. KLOSS: We are at the lunch hour. I really want to thank you all for a fabulous morning. We have learned so much. Thank you sincerely on behalf of our committee. I hope some of you will have a chance to stay on and listen to the afternoon or tomorrow.
Walter has a brief announcement. I just have a question for members of the committee in terms of whether there is any interest in us trying to plan dinner together, or if we should just go back to our rooms and read some more. Any interest in dinner?
MS. BERNSTEIN: Those of you who are interested in dinner, talk to me or Linda or Rachel. We will pull something together for the number of people who can join.
DR. SUAREZ: This is very appropriate, I think, to take this opportunity to mention a few words about one of our members of the National Committee will be leaving us after eight years of participating in the National Committee. It is with a lot of gratefulness and a lot of sadness as well to say that Sally Milan, who is our probably longest-running member at this point of the National Committee at this moment, this will be her last official meeting with the National Committee. It was unfortunate that we couldn’t have a full committee meeting before she departed. I wanted to take a minute to, on behalf of the National Committee, thank Sally for eight years of incredible participation, engagement, leadership, always providing just an amazing level of thoughtfulness in her comments, in her input to all the work that we have done.
I think always thinking about the individual and then the person and the consumer, and I think that is testimony to her personal beliefs. I don’t know how many of you know, but Sally was appointed in West Virginia as one of the first chief privacy offices in state government and in health care. She was initially appointed under the West Virginia Health Care Authority, but then became the statewide chief privacy officer of the state of West Virginia. I think she is a recognized leader in state government privacy. We have been very fortunate to have Sally be part of it. At last count, I actually did a little bit of research behind it. I think I counted more than almost 35 letters in which Sally participated and was a very active person. More than I think almost 10 different reports of all sorts in which Sally had her fingerprints in it.
Sally, again, it has been an honor to serve with you. Really on the behalf of the National Committee, we thank you for your service and your commitment to the National Committee to the topic of privacy and ultimately to the citizens of the United States. I think your sense of responsibility and truly commitment of behalf of the individuals is unique. Thank you again.
MS. KLOSS: I would like to ask everybody if they can to be back at 1:40. That is five minutes before we are to start. But we are going to start promptly at 1:45.
MS. KLOSS: Good afternoon. Welcome back to our hearing on De-identification and HIPAA. As required by our federal advisory procedures we need to just again announce the members of the committee who are here back after lunch.
I am Linda Kloss, member of the Full Committee, co-chair of the Privacy, Confidentiality and Security Subcommittee and also a member of the Standards Committee. I have no conflicts.
DR. SUAREZ: Good afternoon. I am Walter Suarez with Kaiser Permanente. Chair of the National Committee and member of the subcommittees and Work Group and no conflicts.
MR. COUSSOULE: I am Nick Coussoule, member of the Full Committee, as well as the Privacy, Confidentiality, and Security Subcommittee and the Standards Subcommittee. I have no conflicts.
DR. MAYS: Vickie Mays, University of California Los Angeles. I am a member of the Full Committee, Pop and this committee, and I chair the Workgroup on Data Access and Use and I have no conflicts.
MS. EVANS: I am Barbara Evans of University of Houston. I am on the Full Committee and the Subcommittee and have no conflicts.
DR. PHILLIPS: I am Bob Phillips. I am Vice President for Research and Policy at the American Board of Family Medicine. I am on the Full Committee, Subcommittee on Privacy, Confidentiality and Security and on Population Health.
MS. MILAM: Hi, Sallie Milam, West Virginia Health Care Authority. Member of the Full Committee and the Subcommittee. No conflicts.
DR. RIPPEN: I am Helga Rippen, Health Sciences South Carolina, Clemson, University of South Carolina. I am on this Subcommittee and the Full Committee. I am on the Population Health and also Data. I have no conflicts.
MS. KLOSS: Are there any members of the National Committee on the phone? (No response)
MS. KLOSS: Staff to the Committee.
MS. HINES: Good afternoon, Rebecca Hines, I am with CDC/National Center for Health Statistics. I am the Executive Secretary.
MS. BERNSTEIN: I am Maya Bernstein. I work in the Assistant Secretary for Planning and Evaluation. I normally lead staff to the Subcommittee but at the moment Rachel is our leader.
MS. SEEGER: Rachel Seeger, lead staff to the Subcommittee with the Assistant Secretary for Planning and Evaluation.
MS. KLOSS: Just a process question, Rachel. Do we need to do introductions throughout the room again?
MS. SEEGER: It is not officially required.
MS. KLOSS: Then in the interest of time we will not do that.
MS. KLOSS: So, Panel II, we are addressing this afternoon de-identification challenges and we have a fabulous panel of experts to take us through these challenges and help us get a big picture as part of our overall agenda of understanding the issues, challenges, requirements, in light of changing practices. So with that I’ll ask Michelle De Mooy from the Center for Democracy in Technology to kick us off.
MS. DE MOOY: Thank you. For those who don’t know CDT is a non-partisan non-profit technology policy advocacy organization dedicated to protecting civil liberties and human rights on the internet, including privacy, free speech and access to information. I currently serve as the Deputy Director of CDC’s Privacy and Data Project, which focuses on developing privacy safeguards for consumers through a combination of legal, technical, and self-regulatory measures, insuring that services are designed in ways that preserve privacy, establishing protections that apply across the lifecycle of the consumer’s data, and giving consumers control over how their data is used are key elements of protecting privacy in the digital age.
We welcome the subcommittee’s work on the challenges and opportunities in de-identification in today’s rapidly evolving global marketplace. Most definitions of big data, I’ll scale you the volume, philosophy, and variety definitions, I’m sure you may have heard them already. But the definition of big data in healthcare refers to electronic health datasets so large and complex that they’re difficult to manage with traditional hardware and software.
Big data in healthcare is overwhelming, not only because of its volume but also because of the diversity of data types and the speed at which it must be managed. The totality of data related to patient healthcare and wellbeing make up big data in the healthcare industry which can include anything as you know, from doctors notes via clinical decision support systems to patient data and electronic patient records, and social media posts.
The big data phenomenon encompasses not only the proliferation of always on sensing devices that collect ever-larger volumes of data, but also the rapid improvements in processing capabilities that make it possible to easily share and aggregate data from disparate sources, and most importantly to analyze it and draw knowledge from it.
Until recently the healthcare sector lagged in its use of information technology, however that’s rapidly changing due to a variety of factors, including the shift to electronic health records and the emergence of new ways to collect data. The healthcare industry like other sectors faces exciting new opportunities as a result of the cluster of developments occurring under the umbrella of big data.
Already building on a tradition of research, healthcare providers, insurers, pharmaceutical companies, academics, and also many nontraditional entrants, which I’ll speak a little more about later, are applying advanced analytics to large and disparate datasets to gain valuable insight on treatment, safety, public health and efficiency, in particular I assume you may have heard about this a little bit earlier also, hopes are high for big data as a component of the healthcare learning system, which aims to leverage health information to improve and disseminate knowledge about effective prevention and treatment strategies to enhance the quality and efficacy of healthcare.
As big data processing advances so too do privacy questions, though in some ways the privacy issues surrounding research and other health uses of big data are not new. The limits of notice and consent have long been recognized.
Issues of security plague even small data, but big health data is more than just a new term. The health data environment today is vastly different than it was even ten years ago, and will likely change more rapidly in the near future. Traditional frameworks for addressing privacy concerns like the Fair Information Practice Principles were interpreted as requiring entities to specify upfront the purposes for which data was being collected and to limit future use to those specified purposes unless new consent was obtained.
However the analytic capabilities of big data are often aimed precisely at unanticipated secondary uses. Existing practices and policies are unlikely to be sufficient then as the healthcare system, especially with its growing emphasis on learning and big data, involves increasingly complex data flows. Healthcare providers and payers with complex relationships with dozens or hundreds of vendors, partners, and affiliates face truly daunting risks.
Genomic data in particular challenges our notions of effective privacy and security practices like de-identification. Genomic data has changed the calculus from the probability of re-identification, which will certainly grow as more people participate in things like the Precision Medicine Initiative, to the possibility of re-identification or linkage. Advances in data science and information technology are eroding assumptions about the anonymity of DNA specimens and genetic data.
Databases of identified DNA sequences are becoming a de-facto part of law enforcement, while government and commercial direct-to-consumer genetic testing is exploding in the marketplace. That growth is increasing the likelihood that anyone with access, from a genetic genealogist to a hacker, anyone with access to this non-anonymous reference databases type databases could use them to re-identify the person who provided a quote de-identified gene sequence. This is a very real instance, and this is something that’s happening today.
New risks to identification will undoubtedly surface as individuals are profiled using information that’s encoded in their genomes, for example characteristics like ethnicity or eye color as scientists learn how to use these to profile people. Researchers have already discovered ways for example to predict what the face of somebody’s genome — So somebody submits a genomic sequence, and they can do a 3D model of what their face might look like. And this has already been discussed with law enforcement and ways to use this in criminal justice.
It’s really only a matter of time before genomes become a routine part of our medical care and a part of our medical records, and our existing framework for the use of this data is woefully outdated. Patients today generally don’t know when their medical records have been disclosed for research or to whom, and that makes it difficult to object. In the not-so-distant future when medical records include our unique genomes the status quo will be ethically unacceptable.
To date regulators have interpreted federal health policy law to permit providers to treat whole genome sequences as de-identified information subject to no ethical oversight or security precautions, even when genomes are combined with health histories and demographic data. Either this interpretation or the law should be changed. Many EMRs have the potential to record a patient’s consent about participation in research, but this feature is generally not used.
These considerations should be created in a way that does not impede the potential for big data, which in my opinion largely resides in data sharing partnerships. These partnerships could have a huge and positive impact on the healthcare system. Partnerships between providers and patients, between cities and universities and amongst commercial entities have already sprouted across the country.
Government agencies must be a part of this conversation now, creating a flexible framework for sharing that includes sharing rates for individuals, clear rules on data access, revised ethical standards for research, and renewed privacy and security guidelines that address new risk to identification and linkage.
In a healthcare system where data flows across sectors and systems the regulatory and enforcement efforts must not be siloed. Government agencies, including HHS, the FTA, and the FTC, must work together in non-traditional ways to effectively learn from and contribute to this modern healthcare system.
A good example of how the government sort of is impeding but has the opportunity not to some of the progress is in citizen science. I was just at a conference where I learned a lot about citizen science. The efforts of individuals conducting N of 1 experiences to unearth innovative ways to improve symptoms or treat chronic disease has advanced medicine and healthcare in ways previously unimaginable.
The availability of fine-grained data about millions of disease sufferers provides a baseline and a springboard for these types of investigations. But current research and privacy regulations were not designed to support the creation of the vast data resources that today’s science needs, and they need not be at odds with each other. The answer lies in giving individuals more ownership and control over their data.
As Barbara Evans notes in a new paper, these regulations preserve data holders, as she says hospitals, insurers, and other record storage entities, as the prime movers in assembling large-scale data resources for scientific use, and rely on mechanisms such as de-identification of data and waivers of individual consent that have become largely unworkable in the context of citizen science.
Today’s regulations create unnecessary barriers for individuals who seek to access their own research results as well as denying them real control over what will ultimately be done with their data. And I know this approach I’m taking is a little different, but I wanted us to think a lot about individuals, because I think that’s what gets lost sometimes in the discussions about de-identification.
Access to data is one barrier that has stalled the learning healthcare system also. Surveys suggest that 80 percent of Americans would like to see their data used to generally socially beneficial knowledge, but much of the data collected by the medical industry is held and controlled in these large entities. Individuals have little to no control other than to remove it from research, and of course even that can be waived by an institutional review board.
I want to tell you a story from the conference that I just attended, and Barbara was there also, to illustrate the sort of issue that I’m talking about. Dana Lewis and Scott LeBron created something called the Open Artificial Pancreas System. Fascinating work. It’s an open-source project aimed at making safer and effective basic artificial pancreas system technology widely available to improve the burdens of type one diabetes. This has changed people’s lives fundamentally, and this was done by two people who are not computer scientists.
So I asked them about this issue, about barriers to access regulatory problems that they run into, and this is what they said. Patients like ourselves attempting to contribute to research efforts are systematically discouraged from participating in formal research, and are generally forced to work around the system because we get rebuffed when trying to work with institutional researchers.
And institutional researchers do not often have models that are incentivized to work in new and different ways such as with patient researchers. In many cases even after patients have proactively made their data public researchers are unwilling to use it without IRB approval. Our traditional system, far from protecting them, is actively harming patients by making researchers unable to work with them.
There’s a balance here. The government should create new models for regulation, that’s the first, that reflect the diversity of data types and of data partnerships, one that supports privacy and security while encouraging innovative work like Dana and Scott’s. These models might vary in terms of consent and de-identification requirements, but should ultimately retain an individual’s ability to control how and when their health data is collected and used.
Now I’ll shift just briefly to talk a little bit about the question about the risks that big data poses to personal privacy, security, and also on particular groups. The application of big data analytics in general can reproduce existing patterns of discrimination, inherit the prejudice of prior decision makers, or simply reflect the wide set spread biases that persist in society.
In healthcare this can be catastrophic. There are a huge variety of sources from which health or health related data can be derived, such as activity track or social media post, et cetera, incentivizing companies to gather as much as possible in order to tailor their products and services.
Hidden bias in the collection and analysis of this data can lead to unintended consequences for individuals, including unequal access to healthcare services. Inaccurate inferences may also interfere with the person’s care or could be misunderstood outside of the context of clinical care. Targeted advertisements for example may lead to web lining of consumers into categories without regard for the reputational harms or costs that could result.
Google Flu Trends is an oft cited example of the limitations of big data analytics in health. And there’s a reason, it’s a really good story, a sort of clip of how this can go wrong. It highlights the problem of explaining when correlations of meaningful. Patterns are easy to find with big data, but understanding which patterns matter is another matter entirely.
As the FTC’s Report Big Data Tool for Inclusion or Exclusion points out, whenever the source of information for a big data analysis is itself a product of big data, opportunities for reinforcing errors exist. CDT is working on this more intensely and should have some recommendations related to healthcare specifically by the end of the summer.
Now onto sort of more technical stuff about de-identification. So all of that said, and I sort of went back and forth in preparing my testimony, saying de-identification is still good but, de-identification is still good but, and here we are at de-identification is still an important safeguard despite these challenges. Under HIPAA data may be used and shared for secondary purposes if it has been meaningfully de-identified such that the information could not be traced back to a specific patient.
It has been widely argued that de-identified data can be combined with other data to re-identify individuals, and this has led to a criticism on reliance on de-identification as a privacy protecting measure. Though genomic data poses unique risks which I’ve discussed, we believe these concerns, which have mainly been raised with respect to data de-identified and used outside of the HIPAA framework, we think those mostly point in the wrong direction.
While there will always be a risk that de-identified data could be associated with particular patients, good faith, reasonable de-identification schemes when coupled with enforceable limits against re-identification, we think represent a balanced approach to protecting personal privacy while still allowing commercial and scientific value to be extracted from datasets.
HIPAA compliant de-identification provides very important privacy protections. I won’t go into too much detail, I’m sure you’ve heard a lot of it talking about the two methods that HIPAA allows. Despite that there are imperfections in what HIPAA allows, most especially in the safe harbor method. Few entities however actually used the statistical method, the other method, which may provide more protection while yielding greater data utility.
Despite these imperfections it’s important to note that re-identification is difficult for even the most skilled mathematicians and is generally expensive and unsuccessful. Regulation seeks to strike a balance between the risk of re-identification and the accuracy of data, and this attempt to find balance is at the heart of why de-identification as it stands remains mostly sustainable and effective.
So I go into some key recommendations related to the question about what innovations in data science might offer additional protections. I won’t read these verbatim, I’ll spare you all of that, but I’ll just kind of hit on a couple of them. So the first innovation that we wanted to talk about is record linkage. Record linkage happens through matching records that exist in separate data sets but have a common set of data fields.
So we think the test promulgated actually by the Federal Trade Commission for data not regulated by HIPAA provides a similar end more flexible framework. That test states the data is not reasonably linked to an individual or device if one, the party takes reasonable measures to de-identify the data, two commits to not re-identify the data, and three prohibits downstream recipients from re-identifying the data. If HIPAA de-identification standards are found in practice to be too rigid we think a shift to the FTC’s objective standard may be appropriate.
We also talk about the limited data set. There have been concerns about the limits that the privacy rule and the common rule impose on health research as I touched on. Currently under the HIPAA privacy rule research is treated as a secondary use unless data are de-identified under criteria spelled out in the rule or unless data are reduced to what’s known as the limited dataset, individuals must give affirmative written consent to research uses of their data. Administratively such authorizations are very hard to obtain as most people know, as most providers at the very least know, but certainly researchers. And essentially impossible after the data has been collected. This is the reality.
Yet some of the most promising applications as I mentioned before of health big data are in the field of research, how do we handle this? Limited data sets require fewer identifier categories to be removed than is necessary to qualify as de-identified, offering one vehicle for researchers to use without obtaining prior authorization. The recipient of the limited dataset must also sign a data use agreement that sets forth the purpose for which the data can be used and prohibits de-identification going back to those standards.
Another innovation I want to talk about is the idea of re-conceptualizing research uses. Another part of the solution may lie here. The 2012 White House Report for example talks about respect for context as a way to describe the essence of the use limitation principle. Respect for context means that consumers have the right to expect that service providers will collect, use, and disclose personal data only in ways that are consistent with how the data was collected.
In the context of health big data providers can use the initial context of data collection as a guide in circumscribing future uses of that data while still allowing for innovative analytic practices, as long as those are related to the initial collection. And we go into much more detail about that but I will not.
We also think that research in effective anonomization should be supported. One promising line of research already incorporated into the Google Chrome web browser is Rapport, I don’t know if the subcommittee has heard this before. Randomized Aggregatable Privacy Preserving Ordinal Response is what RAPPORT stands for. It’s a technique that randomly changes the values of data while it’s being collected, which has the attractive property of rendering the potentially sensitive raw data that’s being collected meaningless, it’s randomized.
So it allows perfect precision in aggregated data disconnected from an individual. De-identification cannot be presumed to eliminate all risk of re-identification of patient data, therefore it’s important to require assurance from recipients of de-identified. So this is one method I think that the subcommittee should consider as something to recommend.
I’ll show generalization as another technique that you might look at. Generalization is performed with user defined hierarchies which are transformation rules that reduce the precision of attribute values in a stepwise manner. Basically each hierarchy consists of increasing levels, so it’s a technical way to de-identify while also maintaining the accuracy of the data. Distributed networks is another way, and PCORI is one example of that, maybe the subcommittee has heard a little bit about this. It’s the centralization of data by linking distributed data networks in lieu of centralized collection of copies of data.
In the mini-sentinel distributed database for example, which facilitates the safety surveillance of drugs approved by the FDA, participating data sources put their data into a common data model and perform the analytics. The aggregate results are then rolled out to produce the results, sometimes referred to as bringing the questions to the data. This is an important protection.
Finally, I want to bring your attention to security and accountability. Obviously the healthcare sector has lagged on security. This is probably not due to the lack of guidance. The OCR, Office of Civil Rights has provided guidance, as has NIST. Indeed instead it appears that some, maybe many covered entities are not devoting the attention or resources to this issue. There have been many reports of breaches, even today, reports of ransomware, problems.
The OCR we think has been a leader in attempting to break out of its rut, sort of governmental rut, in attempting to meet some of the challenges of today’s healthcare big data system, stepping up enforcement and issuing updated guidance on things like media access to PHI, a very modern issue, and they’ve done a good job in trying to step up to those challenges.
Also I’d like to mention that state attorneys general were granted the authority to enforce HIPAA under hi-tech, but many do not, and that might be an issue to consider how we can sort of use that tool that’s not being used now.
I also want to say one final thing. One of the most important requirements imposed by the HIPAA security rule is that covered entities must conduct an assessment of potential risks and vulnerability to the confidentiality, integrity, and availability of the EHR records they hold.
In the absence of specific security standards we think this risk assessment tool can be an important support. Risk assessment should be ongoing, but it’s especially a good time to conduct or refresh a risk assessment tool when providers or payers are planning the introduction of big data analytic techniques.
So issuing some kind of recommendation that says if you were going to apply analytics to some of your datasets, this would be a good time to reassess the security of those datasets. All measures must be commiserate with and calibrated to privacy risk, and specifically scaled to identifiability. As a result of Hi-Tech Act HIPAA includes a breach notification rule, which I just want to also re-iterate as an important protection that is perhaps not used often.
As the size and diversity of these datasets grow and as analytic techniques make it easier to re-identify data and draw inferences from seemingly innocuous data, internal and external accountability must play a larger role in protecting clinical health data in particular. Regulators should continue to expand their enforcement efforts aimed at obtaining sanctions whenever data is illegally used or transferred or reasonable security protections are not implemented. Thank you very much for listening, and I’m happy to take questions whenever we’re ready.
MS. KLOSS: We move right on, I’d like to introduce Jules Polonetsky to present his testimony.
MR. POLONETSKY: Thanks for the chance to speak with you today. The Future of Privacy Forum is a think-tank or perhaps more of an incubator of hopefully responsible privacy ideas. Our stakeholders range from the Chief Privacy Officers of about 130 companies to academics to advocacy groups to foundations.
We tried to be the centrist voice in the world of privacy, we’re optimists, we really do think data and technology are changing the world and are a force for good, but we get the challenges and the concerns and the hard work that needs to be done to minimize the risks and avoid the horror stories and the gloomy things that are part of the reality as well.
So companies sometimes think we’re way too progressive, looking for new rules, looking to worry and address concerns that some think are only theoretical, and advocates sometimes say big data is a whole bunch of bunk, it’s all about behavioral advertising, clearly this is a bad thing, why are you trying to support or validate that? And of course there’s truth to all of that.
We find that de-identification is the number one issues that the folks we deal with struggle with, and it’s kind of odd because here we are talking about privacy, many of my members are Chief Privacy Officers of global companies, and if we’re dealing with privacy law around the world, we’re dealing with privacy standards, and if we don’t actually understand what we’ve agreed to protect or not because it’s either personal information and it’s regulated or it’s not and it’s not, then what value, all the other work we’re doing, if we haven’t actually agreed on maybe the most central core or definitions, what exactly comes within the bounds of what you’re protecting.
And I’ve stated that because I think it’s actually the wrong way to have been thinking about this, because when we live in a world where either you’re subject to the rules, the obligations, and the restrictions because we’ve defined it as in, or you’re subject to no rules because you’re out, so have a good time, go ahead and do it, there’s no problem, you end up with the situation we have today where we have critics or regulators around the world who don’t want to miss protecting the things they see as possible concerns, and so they need to define their interpretation of their law which covers personal information as broadly as possible, because otherwise they’ve got no business talking to you about what is going on.
At the same time we have industry actors who like to do what they’re doing, and if they can claim it’s anonymous or de-identified or pseudonymous or non-personal, all terms that I’m not quite sure exactly what they mean, Simpson and his report at NIST probably have done the best sort of clear let’s just state for you without having a big policy argument about the words, how they’re generally used and what they at least theoretically mean.
But frankly that’s almost a unique document because if you take a look and do a survey in almost any sector, we were doing some work around connected cars, auto-makers have agreed to a set of principles, legislators are trying to figure out what the rules should be for antonymous vehicles for all the new data that’s being collected by cars, and of course what’s personal information in this world?
And so we started looking at that and we took a look at the different language that’s being used, and it’s all over the place. Department of Motor Vehicles says this, and NHTSA says this and some state law says this, and I’m not sure what they all mean, they all clearly mean something that they want to protect, but it’s all over the place.
On one hand we’ve got those who are skeptical of big data who point to celebrated publicized de-identification attacks to show that it’s not possible in this world of big data to protect the data, there will be some unknown dataset that somehow is found that will pry open or some immeasurable amount of number crunching and deep computing that will somehow identify one, some, many, identify with some degree of certainty. It doesn’t matter, we’re de-identified, and so we have an alarm. When pressed as I’ve pressed sometimes some of those critics they’ve conceded well we just didn’t like behavioral advertising so we wanted to show that was a big scam.
But of course if the data is being used in useful ways by responsible actors we didn’t really mean that there was a database or room being created only when the people who are using the data in the way we don’t like with the consent that we would have liked, so that’s why we took that particular view.
And you probably heard earlier in the day I imagine seeing some of the experts that you had some of the criticism of, the attacks on pseudonymous data sets that have been portrayed as breaking the efficacy of de-identification, and those are great to learn from and great novelty and great understanding of what is feasible, but we probably haven’t done enough to debate well so what, is that level of risk really what we’re trying to prevent against, and well if that dataset wasn’t made public and was subject to controls would we really care or not. We’ve reacted in general to the headlines and not the oh, I guess the answer isn’t don’t proceed down the line of de-identification.
But at the same time I think on the industry side we’ve seen almost obscene terminology used for the most robust collection of data. We see anonymous used for global unique uncontrollable identifiers that are available in marketplaces where anybody can buy data linked to those identifiers and add other information to it, and where companies say well perhaps it’s not personal to me, the fact that everyone else in the world has identifiers behind this, I don’t care, I’m only facilitating that with my particular device or cookie based identifier.
So lots of blame I’d argue, again on every side you can see why both sides sort of complain that we’re being a little bit too moderate and practical and sort of trying to make sense. But to some degree some of the backlash in the world of critics says I don’t trust you industry, look at what you’re doing, you’ve got location, you’ve got a million websites I’ve visited, it’s on my device, I can’t clear it, and you’re calling that anonymous.
So go tell a state school board that well we really need this data, it’s going to help us understand education better, and the risk is very small. Wait, what risk? What risk for my kids? We’re not good at doing that. I’m hopeful that frankly health, where you live with the world of risk every time decisions are made, the medical and health community says this isn’t a perfect thing but it might actually do more good than harm, so we believe that this is a drug, this is a treatment or procedure that’s in the market. We’re really not good at it anywhere else.
I can talk until I’m blue in the face about how autonomous vehicles have a pretty good record and have caused X number of accidents, X number of million, and that every single day people are killed by others who are texting or who are drunk or who are sleepy or are tired. So what are you scared of exactly?
You should be scared right now about getting on the street, and you’re worried about the autonomous vehicle like the days when we worried about horses being replaced by fire breathing machines that were going to be very risky as opposed to like I’m scared to stand next to a horse today.
So we’re really bad, and here now we’ve got to bring some assessment of risk into our analysis, and it’s really hard to tell and to decide what risk. Risk of statistical possibility of an identification, or risk that actually anything harmful will happen? What value do we give to controls? Clearly there needs to be an understanding of that data that is subject to what we consider credible controls.
And yes if you’ve shown that you can hack a voting machine, and if it’s remote and if it’s an old machine and it took a lot of work, I don’t care, if you’ve shown it is possible to hack the voting machine or if you have exposed a potential security risk, we know that somebody eventually will exploit it and that there is a real concern.
So it’s really understandable why in some corners of the world, or if you have a model where no it’s no longer regulated if it’s in that category and it can be shared and it can be made public, of course that level of de-identification needs to be one that’s more significant. And so what value we afford to controls and what they need to actually look like and whether that can be calculated into how significant of an effort I need to do to minimize the risk of de-identification is one that I think we don’t adequately build into our process.
A lot of folks do work globally, and so they’ve had to deal with the European definitions of personal information, and the Europeans have been a little bit inconsistent and all over the place, as have we. Of course there you’re either regulated or you’re not, and so regulators that have wanted to ensure that the activities they want to cover are capture have often spread broadly their definition. But once you’re subject to that definition you’re subject to a whole bunch of obligations, many of them which are completely impossible to comply with.
But it’s been obscure, you can pull it together, and we’ve pulled a little bit of it in our paper, but you can pull it together by informal opinions, rare enforcements where you’ve seen what a public resolution of a particular thing around a particular identifier is, and it’s been phenomenal for lawyers who have followed the nuances of what their particular national regulator allows or doesn’t allow in a case by case basis, but it’s been terrible for general law, and certainly for those on the outside who look at the law and say well that’s the definition, and then I can’t comply so I must declare what I’m doing anonymous because if I had to comply I couldn’t not only do my business, I couldn’t comply with all the specific nuances around access and so forth that are obligatory.
HIPAA and FERPA in some ways have an interesting model, in that there at least is a carve out for certain limited we declare this personal but for research purposes and so forth in some specific matters we understand that we have an acceptable use given the minimal risk, yes it’s personal. But those are rare, and in some sense what we argue for, and I don’t know if you’re able to pull up our charter, although I guess you all have it here, is that actual recognition that there is a spectrum of personal, less personal, less personal, less personal.
Let’s leave the specific terminologies since they’re all used and mis-used to mean all sorts of different things by either lawmakers or technologists or practitioners. But how do we take the notion, which I think the technical de-identification experts appreciate that there is a spectrum of risk, from this is a high level number and is really meaningless, to this is a clearly explicit de-identification.
And if we’re able to agree that there are different stages of data, and we’ve tried to in our chart draw lines between data that is explicitly identified, I have your name, we can giggle about the particular attributes that are obvious, I’ve done something, but I get that it could be redone without too much effort.
But should there be some value, should there be some rules that are a little bit different when you’ve done something? Maybe, maybe not. How about you’ve actually taken some serious steps? But it’s possible, but it’s going to take six PhD students and a professor and access to data that is otherwise going to be protected.
Can we recognize that whatever we call it, maybe we actually want to have restrictions on the use? Sure that’s personal, or sure that’s not personal category, whatever we called it here, not readily identifiable. So maybe I want to allow that sort of use, but I want a prominent notice, I want a choice, I want a retention limitation, I want sensitive data to be treated differently as opposed to having to declare it and then treating that data like the most sensitive or the most risky. We tried to lump a couple of different flavors of pseudonymous, it ends up being sliced and diced in different ways.
Our suggestion is that we look at whether or not, and again this is built on some of the really good work that many who are in this room and others have done, is there a direct identifier in this dataset. We don’t define that often in our legal language, we say is it personal, does it relate to an individual, is it reasonably linked to an individual. But we need to actually sit back and say is there a direct identifier here, because if so that’s one clear set of categories that might be subject to these sorts of restrictions and these sorts of uses.
Are there indirect identifiers, and if so, given that those are clues of different variety, have we taken steps to limit that or protect that, or we’ll still call it personal but we’re going to now recognize that the utility of data is enhanced in some significant way that is going to benefit society. And then again what role or controls. If none, then why are we bothering? We clearly recognize that controls have value, so where can we apply controls to say this dataset with controls not made public with limited sensitivity is used or not used subject to these sorts of purposes.
And so if we’re able to establish a spectrum, and I get this is hard if we’re doing here is a legal code go out and follow it, but the reality is we’re dealing with a spectrum here, we’re dealing with shades of grey, and if we’re able to come up with several more stages with the kind of rules that are applicable, we’re seeing that happen a little bit in Europe, the new European general data protection regulation has different treatment for research.
And it’s not quite clear what is research, and whether that’s only going to be civil society or government research, companies do research as well, when is research research, another challenge, and we’re going to start veering for a second into the ethics conversations, many of our European colleagues believe that research is the job of government or civil society, and when you say well companies are doing research, oh but it’s going to be used for marketing. I think we recognize that there is an important crossover back and forth and that corporate data is often what academics are using, and just because data is public doesn’t mean it doesn’t have any privacy value.
It’s hard to have this de-identification conversation without dealing with some of the really important ethics decisions. So we talked about how it’s hard to assess risk, but at least we are able to assess risk. Privacy professionals do privacy impact assessments, there’s a robust history of how you go ahead and do that. Where we’re not good at doing or where we don’t have at least consensus and agreement, is how do you assess benefit?
Because once I’ve assessed risk, if the decision is risk here, risk here, do it or don’t do it, and we don’t get to the second question of well but I’m going to cure a disease, I’m going to save the world from exploding, I’m going to achieve this great thing, so isn’t that minimal risk or that medium risk moral and ethical for us to actually go ahead and proceed? Okay, well that’s hard now. Whose benefit? Again, my critic says well sure the behavioral advertising people, they think they’re benefiting the world, they think it’s all good, you get better ads so they think they’re good.
How can I trust you, the company, or even the researcher who is doing what they do and has a goal and a mission to do their work, we know that in the academy we might go to IRBs or we might have bioethics, but what about everybody else who is increasingly doing what is important research. Michelle did some incredible work recently with Fitbit and looked at their ethics process. I don’t know, they’re improving their product, there’s money, three’s business, but the product is supposed to make us more active, but then sometimes there’s actual research being done.
And so I think when you look at some of the takeaways of the recent OSTP White House look at big data in discrimination, at the end of the day you see a call for more work on ethics. What is ethics outside the IRB? Because if I’m doing a project and I need data at a certain granularity and there is going to be some risk of re-identification, where and when and who do I go to if I’m not sitting in an IRB academic environment or a common rule covered environment, where do I go?
And so right now I don’t talk to anybody. I go do it, and it’s business and it’s industry, and who knows, it may be ethically tainted, it may not. That may be research which if it’s not published and made available we will see leaving the public domain or never reproducible. Google will discover the next great thing and they won’t tell us about it because we’ll worry about whether they tested on humans. We need that sort of data being visible, being part of the public domain where we can critique and push back and say, it’s right or not right.
Those processes today are critical, and there is increasingly work being done across different sectors, both academics who do work that isn’t covered by the common rule and well you’re not covered by an IRB, it’s public data, no problem, and then are getting pushback when the public says well wait a second that doesn’t seem right, that’s certainly experimenting on human data. So what that it was public, so what that it was Twitter data, clearly we see the impact on the individual. So they want that sort of ethical structure.
And then again we’ve got companies and others who are doing dissimilar work. So if we’re able to setup a model for a benefit, and when the FTC takes a look at its unfairness jurisdiction for instance, benefit matters if I’ve created some risk. But I can show well yes because that was in the process of delivering this benefit, that’s factor in determining whether or not this was actually fair for you to do or not.
And many of the FTC security cases are in the area of security where does it really matter what you promised if you did something that put my data at risk, how is there any benefit to me there? And perhaps if I can show there is benefit then there’s less of an argument that there’s an unfairness case. So there’s room for this, the Europeans have even kicked off an effort to figure out where ethics fit into their new general data privacy regulation, and I think again you saw the White House Report calling for more research around that.
And there are challenges there, because if I just let somebody obscurely without any visibility at a company decide there was some risk of de-identification but what I do is really good, all is good, don’t bother, what kind of transparency should exist, what sort of independence should it be? Do you need different people from the company? Do I need an outsider there? Do I need to publish the standards? Maybe it will help me in terms of liability, maybe it will hurt me.
Completely open. We have an event actually on June 14th where Facebook and ATT and hopefully Michelle and others who are working on research standards around ethics are going to walk through, this is one of the areas, it’s not the only area obviously of decision making that might need to be made internally, what level of risk am I ready to —
So I’ll conclude with noting that it’s going to be a little more complicated, the FCC is in the middle of doing its rulemaking to cover the swath of actors that it now covers for privacy post their net neutrality reclassification, so the telecoms and the ISPs, or in FCC speak the bias providers.
So they are defining what is personal information under their effort and although that will cover just one part of the world it will intersect obviously with healthcare. We’ve got wireless, we’ve got transmission, we’ve got sort of data over those wires, so it will be relevant and it will be another sort of major agency sort of using terminology, so you won’t be surprised for us that we’re saying don’t invent something new please, build on what is there, recognize that there is a spectrum.
So I appreciate the chance to speak with you and hope folks who would like more take a look at the paper that we’ve just published called Shades of Grey charting a spectrum of de-identification and this visual which hopefully lays out a little bit of kind of a logical progression. Thank you.
MS. KLOSS: We have distributed some of your other articles but we’ll have to get that new one. Is it now published?
MR. POLONETTSKY: It’s on SSRN, I think it’s being published by the Law Journal shortly, but it’s on SSRN, or our website.
MS. KLOSS: Thank you so much. Next we’ll hear from Ashley Predith, Dr. Predith, Executive Director President’s Council of Advisors on Science and Technology, PCAST.
DR. PREDITH: Thank you very much. As noted I’m Ashley Predith, I’ve been the Executive Director of PCAST for about 10 weeks so I was not involved with the development of this report, but I know that the PCAST members still feel very strongly about it two years later. The PCAST report was released in May of 2014, and the official record of their thoughts and the recommendations is that report.
On January 17 2014, President Obama had a speech in which he requested analysis of big data and privacy and their implications for policy. In that speech he asked for a scoping study focusing on the wider economy and society. He wanted the group to look in particular at how challenges inherent in big data are being confronted by both the public and private sectors. He wanted to know whether we can forge international norms on how to manage this data, and how we can continue to promote the free flow of information in ways that are consistent with both privacy and security.
Now in the request for a study on this two reports came out, one being what we call the White House or the Podesta Report, and the other one being the PCAST Report. The PCAST Report is entitled Big Data and Privacy, a technological Perspective, because they look specifically at the technology piece of this.
In general there were three objectives to the report, the first being to assess current technologies for managing and analyzing big data and preserving privacy, the second to consider how big data and privacy will change with future technologies, and the third was to make a relative assessment of the technical feasibilities of different broad policy approaches. So how does the changing technology affect the design and enforcement of public policies for protecting privacy?
About nine of the 19 PCAST members were at a sub- working group of PCAST that did a lot of work on this in a very short amount of time. They were cochaired by Susan Graham from UC Berkley who is a computer scientist, as well as Bill Press who is from University of Texas and who is also a computer scientist. They were aided by Marjory Blumenthal who is a previous PCAST Executive Director.
Now it’s already been noted what the changing technological contexts are here that bring about the question of big data, everything from going from small data to big data, the idea that data are collected continuously with or without our knowledge. Sometimes they are from purely digital sources like cookies on your computer, or sometimes they are measurements or byproducts of the physical world. There are a lot of implications for this, both around how the analysis of that data is done as well as how different streams of data come together in data infusion and data immigration.
And also an important part of this technological context is the cloud as a dominant infrastructure in computing. It’s allowing for easy ingestion, access, and use of data, easy replication and distribution of data, and infrastructure for mobility of data as well, through things like smart phone apps and beyond. And it also brings about itself potential security benefits from automation, procedures and oversight.
In the PCAST report they mentioned a couple of examples involving healthcare specifically, this report was broader to look at all implications across society, not just specifically healthcare, in that they talk about issues of privacy when you come to look at say personalized medicine.
So for example when they say researchers will soon be able to draw on millions of health records, vast amounts of genomic information, extensive data, successful and unsuccessful clinical trials, hospital records, and more, they come to understand more about how some people with particular traits respond better to better treatments.
Now of course if you’re a doctor in that study you might want to know who those particular people are even though they were anonymous at the time the data was stored. And in that case one technological solution might be to create a protected query mechanism so that individuals can find out if they’re in that cohort who have those traits that might be responding to treatments.
Or perhaps to provide an alert mechanism based on the cohort characteristics so that when a medical professional sees a patient in the cohort a notice is generated. So that’s an example where maybe there is a technological solution that one could put into place when they’re trying to identify particular people who might benefit from something.
A second example they gave though was something to do more with a consumer device that has kind of a health application to it, and that is mobile devices to assist adults with cognitive decline and monitor them for dementias. So as it happens to be PCAST also released a report just two months ago called Independence Technology and Connection in Older Age and they bring up the questions around this, the privacy implications of the monitoring technologies also in that report.
And here they say that many people were in an aging population in the United States, the baby boomer population is reaching the final decades of their lives, and there is the concern that as the natural process of cognition goes on maybe there are apps or something like that, that can help a person with some memory issues, things like that, or perhaps can monitor them to see if they have an Alzheimer’s disease or something starting to form.
Now when you have an app like that there’s a significant collection of data going on from a person, in fact a person who may have cognitive decline. And so what are the conditions of the use of that app, what happens if the information is leaked, can something be inferred or can inferred information from those apps be sold? And these are the kinds of questions for which there are no answers right now.
Now despite all this PCAST firmly believes that the positive benefits of technology are or can be greater than any new harms to privacy as long as the policies are put in place for them. So they see incredible potential for big data, and its analysis, and that is the strong message coming across there.
What they ended up doing was analyzing four different technologies and strategies for privacy protection, and then they also looked at three robust technologies to look at for the future. So I’ll step through the four technologies they looked at and give a brief overview of them.
The first one is cryptography and encryption for cybersecurity. Cybersecurity involves technologies that enforce policies for computer use and communications, it allows systems to protect identity and to authenticate to say the user is who they say they are. And it’s considered kind of a part of the privacy question, however it can certainly be said that poor cybersecurity is a threat to privacy.
But violations of privacy are possible with no failure in computer security. So misuse of data or fusion of data from different sources can certainly reveal someone’s identity or tell something about them that they may want to remain private. So cybersecurity by itself is a technology and a strategy that’s necessary but not sufficient.
The second technology or strategy looked at was anonomization and de-identification. So on this one they said it’s becoming easy to defeat these techniques by using legitimately used techniques in big data analytics. They gave the example of the Sweeny, Abu, and Wynn white paper from 2013 that described the process of using Personal Genome Project profiles containing zip codes, birth date, and gender, combined with public voter rolls in mining for names hidden in cached documents, and they were able to correctly identify 84 to 97 percent of profiles that had names.
Now PCAST felt that anonomization and de-identification sometimes give a false expectation of privacy where data lacking certain identifiers are deemed not to be personally identifiable information. They felt that anonomization remained somewhat useful as an added safeguard, but it is not robust against near-term future re-identification methods, and PCAST does not believe it is a useful basis for policy.
The third technology or strategy looked at was data deletion in ephemerality. So that they said the good business practices say that data that no longer has value should be deleted. What we know now is that big data frequently is able to find economic or social value that was previously unknown in that data. Big data practices can find economic research value.
And so there’s also increasingly large amounts of latent information in big data that may not immediately be known. So it’s not even clear in a dataset per say or in a data analysis whether all of that data on an individual is actually known, and so when you’re trying to delete something what pieces should you be deleting.
PCAST felt that given the distributed and redundant nature of data storage it’s not clear that data can actually be deleted with assurance. They felt that ephemeral data, the idea that it would kind of show up and then be deleted in some short amount of time, couldn’t be guaranteed. There’s no guarantee that people actually play by those rules. So their conclusion on anonomization and de-identification is that from a policy making perspective the only viable assumption today and for the foreseeable future is that data once created are permanent.
The fourth technology and strategy that they looked at was notice and consent, and they said that this was most likely used for protecting consumer privacy, but that it places the burden of privacy protection on the individual, the opposite of what is meant by a right.
Notice and consent leaves an unlevel playing field in the implicit privacy negotiation between a provider and a user. The provider has a complex set of take it or leave it terms and legal firepower, while the user has a few seconds of mental effort and a long list to read. There’s a market failure there they felt. And it makes sense that maybe there needs to be a representative for that user in that market mechanism.
So those four technologies and strategies, from cryptography to anonomization, data deletion and notice of consent, what they talked about were present now. Then they also said there’s three robust technologies going forward that are worth considering more strongly.
The first being profiles, privacy profiles, which is sort of a follow-up on notice and consent. They say the purpose of notice and consent is that the user assents to the collection and use of personal data for a stated purpose that is acceptable to that individual. And right now notice and consent is becoming unworkable.
PCAST believes that the responsibility for using personal data in accordance with the user preferences should rest with the provider, possibly assisted with a mutually accepted intermediary rather than with the user. They said to consider the idea of privacy profiles, whereby which a third party, so for example maybe the American Civil Liberties Union could create a voluntary profile by which in the case of the ACLU perhaps it would give way to individual rights, and that these third party organizations that create these profiles could then vet apps that are out there in the marketplace and then say whether or not those apps actually follow the privacy profiles that they’ve set up.
Now by vetting apps the third party organizations create a marketplace for the negotiation of community standards for privacy. PCAST felt that the United States Government could encourage the development of standard machine readable interfaces for the communication of privacy implications between users and assessors, and then of course there would be the need to automate this process as much as possible.
So the second technology or strategy we looked at was around the context and use question. And for that they said that the application of privacy policies to the use of personal data for a particular purpose, IE its context those policies need to be associated both with the data and with the code that operates on the data. They said the privacy policies of the output data must be computed from those inputs from the policies of the code and the intended use that is the context of the outputs. And that information has to be tamperproof and it has to be sticky and stuck with that data as it goes through wherever that data is going.
The third technology they talked about as a robust technology going forward is about enforcement and deterrence using auditing technology. And here they said that it should be a straightforward process to associate data with metadata, there’s a lot of information that can be captured with metadata that can be audited, so where the data came from, detailing its access and use policies, authorization, logs of actual use, and extending such metadata to derived or shared data or secondary use data to go with privacy we’re logging can facilitate that process. So there needs to be a way technologically in order to do auditing in a regular way.
So I’ll talk a little bit about PCAST kind of perspectives and conclusions and then summarize their five main recommendations in their report. So they felt that big data analytics is still growing and that new sources of big data are abundant and that new data analytic tools will emerge. New data aggregation and processing can bring enormous economic and social benefits.
It’s not always possible to recognize privacy sensitive data when collected, and when merged with analytics it may be possible to even hone in on the moment when the privacy of an individual is being even infringed upon if you’re able to bring all of these pieces together. They said that the government role to prevent breaches of privacy that can harm individuals and to groups will cause technology plus law and regulation to generate incentives and to contend with the measure and countermeasure cycle on this.
They said that attention to collecting practices may reduce risk, but paying attention to the use of the data is the most technically feasible place to apply regulation. And then they also said that the technological feasibility of all of this really matters. Basing policy on the control of collection is unlikely to succeed except in very limited circumstances.
They do concede that circumstance may look something like if you have an explicitly private context, in that case sometimes the disclosure of health data can be that. Plus with a meaningful, and they emphasize meaningful, explicit or implicit notice and consent, maybe by privacy preference profiles, but done in a way that doesn’t even exist today.
From all of these conclusions they have five main recommendations, and of course PCAST as a federal advisory committee, only makes recommendations to the federal government as to what the federal government should do.
So first they said as I stated earlier, recommendation one, the policy attention should focus more on the actual uses of big data and less on its collection and analysis. It’s not the data themselves that cause harm nor the program itself absent any data, but the compliments of the two.
Recommendation number two, policies and regulation should not embed particular technological solutions, but rather should be stated in terms of intended outcomes. So technology alone is not sufficient to protect privacy, and to avoid overly lagging the technology policy concerning privacy protection should address the purpose, the what, rather than prescribe the mechanism, the how.
Regulating disclosure of health information by regulating the use of anonomization fails to capture the power of data fusion. Regulating control of the inappropriate disclosure of health information, no matter how the data are required, is more robust.
Recommendation three. With support from OSTP, the networking and information technology research and development agencies should strengthen US research in privacy related technologies and the relevant areas of social science that inform the successful application of those technologies. Research funding is needed for technologies to help protect privacy, social mechanisms that influence privacy preserving behavior, and legal options that are robust to changes in technology and create appropriate balance among academic opportunity, national priorities, and privacy protection.
Recommendation four. OSTP together with the appropriate educational institutions and professional societies should encourage increasing education and training opportunities concerning privacy protection. That is, there need to be career paths for professionals who are concerned about privacy.
Recommendation five is that the United States should adopt policies that stimulate the use of practical privacy protecting technologies that exist today. It can exhibit global leadership both by its convening power and also by its procurement processes. They felt the government should nurture the commercial potential of privacy enhancing technologies through procurement. It should establish and promote the creation and adoption of standards. And they also said at this point PCAST is not aware of more effective innovation or strategies being developed abroad, so this is a place where the US could take leadership. Thank you.
MS. KLOSS: Thank you very much. And our last panelist is Cora Tung Han from the Federal Trade Commission, and we thank you for being with us.
MS. HAN: Thank you for having me. I am Cora Han, I’m with the Division of Privacy and Identify Protection within the Federal Trade Commission. The Federal Trade Commission is a civil law enforcement agency with a consumer protection and competition mandate. And as part of our consumer protection mission we focus on promoting data security and protecting privacy through a mixture of enforcement work, policy, and consumer and business education.
And so I’m going to focus my remarks today on our report in 2014 on data brokers but I’ll also spend a few minutes at the end also talking briefly about our big data report and what we’ve said about de-identification, since those may also be relevant for this group. But before I do any of that I should issue a disclaimer, and that is that the views I express today are only my own and they don’t represent the views of the Commission or any of the commissioners.
So let’s get started with the Data Broker Report. The data broker report entitled, A Call for Transparency and Accountability was published in May 2014, and it arose out of a process the Commission kicked off at the end of 2012, when the Commission issued identical orders under section 6(b) of the FTC Act to nine data brokers: Acxiom, Corelogic, Datalogix, eBureau, ID Analytics, Intelius, Peek You, Rapleaf, and Recorded Future. The Commission selected these nine data brokers because they represent a broad swatch of activity from a cross section of large, midsize, and small data brokers.
And the Commission also considered other factors, including the variety of products that these companies sold, the amount and types of data that they collect, and their use of different models to collect, share, and analyze this data. Under the order the Commission sought a variety of data, including the nature and sources of data the nine data brokers collect, how they use, maintain, and disseminate the data, and the extent to which they give consumers access to the information and the ability for consumers to correct and/or opt out.
The Commission reviewed the information, compiled it, and put out a report, and this report summarized the findings, proposes legislation, and recommended best practices. It’s a fairly lengthy report so I won’t be able to go through all of it, but I’ll try and hit some of the highlights.
As this graphic represents data brokers or their sources are constantly gathering information from consumers. This occurs as consumers are using a mobile device, shopping for a home or car, subscribing to a magazine, making purchases from stores, using social media, subscribing to an online news site through a variety of ways data brokers and their sources are always collecting information.
And the report defines data brokers as companies whose primary business is collecting personal information about consumers from a variety of sources and aggregating, analyzing, and sharing this information or information derived from it for purposes such as marketing products, verifying an individual’s identity, or detecting fraud.
I’m going to talk a little bit about the sources, the product, and how these products are developed. The report recognizes that data brokers collect information from the three main sources listed in this slide. So government sources, and that includes the US Census Bureau, the Social Security Administration, the Postal Service, it also includes state and local governments including things like professional licenses, recreational licenses, motor vehicle and driving records, and voter registration information.
In addition over half the data brokers obtain additional publically available information. And that information might come from sources including telephone directories, press reports, and information that individuals may for example post on social media.
Finally, eight of the data brokers purchase information from commercial sources, and those include retailers, catalogue companies, websites, financial service companies, and even other data brokers. And in our report, I don’t have a slide of this, but there is this graphic that shows the complicated web that the data brokers have of buying and selling information actually between and amongst each other.
In developing their products data brokers use not only the raw data they acquire from their sources, and by raw data I mean things like name, address, and age, but they also derive additional data. So for example a data broker might infer that an individual with a boating license has an interest in boating, or that someone that goes to zappos.com has an interest in buying shoes, which in my case would be true. Essentially they make consumptions about consumers and they create derived data or data elements from pieces of data that they have.
Data brokers will also take several raw and derived data elements about consumers to place consumers into segments. So they will combine data elements to create a list of consumers who have similar characteristics. For example, a data segment called Soccer Moms might include all women between certain ages with children all who have purchased sporting goods in the last two years.
In our report in Appendix B there is a list of the elements and segments, and I encourage you all to take a look. But it’s important to note that several of these segments actually reflect health conditions and other health issues. For example one of the segments is called Allergy Sufferer.
Regarding storage, the report noted that data brokers actually store the data in a variety of different ways. So some of them actually store data in the form of individual consumer profiles. Other data brokers store data by listing events in a database. So for example Jane Doe opened a house, Jane Doe is deceased, a series of events.
And others maintain databases that correspond to the sources of the data. For example court records and real-estate transactions. In addition the report indicated that data brokers sometimes exclude or suppress data from their products in a variety of ways, depending on what product it is that they are providing and the purposes for those products.
So let’s talk a little bit about what those products are. The report indicated that data brokers offer products in three broad categories: marketing, risk mitigation, and people search. These products combine for a total of $426 million in annual revenue in 2012 for just the nine data brokers that were studied. So it’s a very big market.
I won’t go over all of the different products that were described in the report, but I will say that the marketing products included direct marketing products, online marketing products, and marketing analytics for a client for example to determine the impact of a particular advertising campaign.
The risk mitigation category is generally classified into two main types of products. The first is identity verification. So that’s determining that someone who is actually trying to open an account for example is who they say they are. And the second type is fraud detection, which is a product that helps clients identify or reduce fraud. So for example one product offered by a data broker that was surveyed indicated whether email addresses had a history of fraudulent transactions associated with it.
And the final product is people search. And so this is exactly what it sounds like. It allows you to provide a few pieces of information about someone and then the product will actually provide a whole bunch of additional information. And that includes anything from age and data of birth to telephone numbers to gender to address history, educational information, relatives, employment history, marriage records, a whole host of information.
Ultimately the report concluded with a number of findings. First, data brokers collect consumer data from numerous sources, largely without consumers’ knowledge. A lot of this collection is occurring but consumers do not know it is happening, and then that information is shared in and amongst a variety of other entities, including other data brokers.
Two, the data broker industry is complex, with multiple layers of data brokers providing data to each other. This makes it virtually impossible for consumer to determine how a data broker obtained his or her information, because a consumer would essentially have to retrace a path through a series of data brokers to trace it back.
Three, data brokers collect and store billions of data elements covering nearly every US consumer.
Four, data brokers combine and analyze data about consumers to make inferences about them, including potentially sensitive information. The report particularly focuses on categories including ethnicity and income levels and also health related topics and conditions. So in addition to what I mentioned before, expectant parent, diabetes interest, cholesterol focus, all of those are data segments.
And finally data brokers combine online and offline data to market to consumers online.
As to consumer choice, the report concluded that to the extent data brokers offer consumers choices about their data the choices are largely invisible and incomplete. And just to give you a flavor of the findings that are set forth in more detail in the report, with respect to marketing products most provided consumers with limited access to some but not all of the actual and derived data they had about them. However only two of the surveyed data brokers allowed consumers to correct their personal information.
Four of the five allowed consumers to opt out of the use of their personal information for marketing purposes, but it’s not clear really how consumers would exercise that choice since they would have to know about the existence of the data brokers, and then find the website, and then actually go through all of the steps to opt out.
Regarding risk mitigation products, of the four that provided those products only two provided consumers with some form of access to their information, and only one allows consumers the ability to correct wrong information.
And then finally for people search products consumers generally can access their information, but not all of the data brokers allow consumers the ability to opt out of the disclosure of their information.
And as to benefits and risks, the report concluded that consumers do benefit from many of the purposes for which data brokers collect information. These products help prevent fraud, they improve product offerings, and they tailor advertisements to consumers, which some consumers definitely do enjoy.
But at the same time many of the purposes for which data brokers collect and use data do pose risks to consumers. So for example if a consumer is denied the ability to conclude a transaction based on an error in a risk mitigation product, the consumer could be harmed without knowing why.
In addition, another example is while a certain data segment could help a company offer tailored advertising that would be beneficial to a consumer, if that same information was used by an insurance company to assume for example that the consumer engages in risky behavior such as the biker enthusiast example here, it could potentially be harmful. And finally, to echo what others here have said, storing data about consumers indefinitely does also raise security risks.
Based on these findings the report makes several legislative recommendations. For marketing products the report encourages the creation of legislation that would give consumers the right to access data at a reasonable level of detail, including inferences, and opt out of its use for marketing. For risk mitigation products the report encourages the creation of legislation that provides consumers with notice when a risk mitigation product denies the consumer the ability to complete a transaction, and access and correction of the information.
And finally, as to people search products, the report encourages the creation of legislation that provides consumers with access to information, the ability to suppress information, the disclosure of source of information, and disclosure of limitations of any opt out that’s provided.
Finally the report makes several best practice recommendations. First it continues to call on data brokers to adopt the principles contained in the privacy report, and specifically to incorporate privacy by design. The report encourages data brokers for example to collect only the data that they need, and to properly dispose of the data as it becomes less useful.
Second, the report calls on companies to implement better measures to refrain from collecting information from children and teens, particularly in marketing products. So this recommendation was made because the Commission found that while some of the data brokers actually do have a policy against collecting and using information about children or teens, some of them were merely relying on their sources to suppress this information but don’t take any additional steps to ensure that it’s properly suppressed.
And finally, the report calls on companies to take reasonable precautions to ensure that downstream users of the data don’t use it for eligibility determinations or for unlawful discriminatory purposes. And so what does that mean? Well for example while the data segment smoker in household could be used to market a new air filter, a downstream entity might also use that segment to suggest that the person is a poor credit risk, a poor insurance risk, or an unsuitable candidate for admission to a university.
So what are some of the reasonable precautions a data broker could take? Some of the data brokers are already contractually limiting the purposes for which their clients can use the data. But in addition, and some do do this, they could go further by auditing their clients to ascertain that it’s not being used for a contractually prohibited purpose.
Switching gears just briefly, I’m going to talk a little bit about our Big Data Report, which was published in January of this year. So this report draws on the FTC’s 2015 workshop on big data as well as the Commission Seminar in the spring of 2014 on alternative scoring products.
In contrast to the Data Broker Report, which is a little bit more of a policy report, the Big Data Report is intended to provide guidance for businesses as they use big data. And so it really takes a look at the question of after you’ve collected and analyzed the data how are you using it, and how can companies use big data to help consumers, and what steps can they take to avoid inadvertently harming consumers through big data analytics.
The report talks about benefits and risks of which several have already been mentioned here today, and I will just highlight that among the benefits include providing healthcare tailored to individual patients’ characteristics, and providing specialized healthcare to underserved communities, as well as risks, and those include individuals mistakenly being denied opportunities based on the actions of others.
For example, if others with characteristics like you are less likely to pay credit card bills then you might not see those offers. The creation or re-enforcement of existing disparities. Potentially exposing sensitive information. Existing in the targeting of vulnerable consumers for fraud. Creating new justifications for exclusion. And weakening the effectiveness of consumer choice.
Among other mitigating factors the report suggests being aware of the laws that are in this area. And I will not go into those although the report does in depth. But the report also talks about research considerations, and I will very briefly touch on those.
So first consider whether your datasets are missing information from particular populations, and if they are take appropriate steps to recognize this problem. Second, review your data sets and algorithms to ensure that hidden biases are not having an unintended impact on certain populations. Third, remember that just because big data found a correlation it does not necessarily mean that that correlation is meaningful, and this is exactly what Michelle talked about earlier.
And finally, consider whether fairness and ethical considerations advise against using big data in certain circumstances. Overall I think the message is to realize that there are potential biases that can occur in every step of the process, whether it be gathering your data, determining what your algorithm is, trying to figure out what your question is, and trying to figure out what the results mean.
And then I will end with just what we have said before about de-identification, which again I know has already been mentioned by other panelists here. But we have a three step approach which would first recommend reasonable steps to de-identify data, including by keeping up with technological developments.
Two, having a public commitment not to re-identify the data. And three, having enforceable contracts requiring any third party to commit not to re-identify. And with those these remain vital tools to protect consumer privacy in the era of big data. So thank you again for having me here, and I look forward to the questions.
MS. KLOSS: Thank you very much. I think we’ll want to make sure we circulate these slides to the members of the committee. And we want the Big Data Report also, and the Data Brokers Report.
MS. BERNSTEIN: I can get it to you but it is certainly available on the website. I did circulate just to the subcommittee since I happen to be online, the new report that’s going to appear in the Santa Clara Lar(?).
MS. KLOSS: We’re open for questions. I think we should hold our break until Jules and Michelle, until the questions for those two are out.
MS. MILAM: Would you paint me a picture of how it should work with a consumer? In talking about big data, what should the consumer know, what is the role of the notice, and what sort of technology or practices should be in place such that the data is useful and the consumer has the needed control and the right amount of transparency and site has accountability?
MR. POLONETSKY: Do you have a week, two weeks, or a month? I’ll just suggest that solving all privacy is going to be a giant challenge, and I don’t intend to try to solve all privacy. I think the challenge, let me rephrase your question, how about this: When do we not need to talk to the consumer at all because there’s no problem here? If we can agree with that piece, when is there no problem because it’s so well de-identified, do I need to report to you that I keep count of how many customers or how many patients in some aggregate way, there’s 1000 people who seek healthcare in the US, not an issue.
How far can we draw, well wait a second I’m now doing unique research, and it’s really well de-identified, but I’m doing research built off your data. Do we need to tell you at all? Today we often don’t, we say we use the data for research. There’s been some suggestion in recent thinking that even if we’ve done a pretty good job at de-identifying it you have some right to know or object or at least be aware that science is happening here even if there’s nothing that relates to you as an individual.
That’s worth debating and determining whether that’s feasible, am I ever going to give you any information that’s relevant, does it end up just being hey we do research in our policy because we can’t tell you because we don’t know, three years later we learn that this data set which is so well aggregated we don’t even know where it came from, but wow that’s some interesting learning about this particular population. So is there logic to that, or maybe in some places there are. It sounded like your question might be how about when we do have some reason to tell you —
MS. MILAM: Let me narrow my question. I’m really thinking of something specific. So, when there is an important reason for a consumer to care about where the data goes, are there reasons to tag the data, to have the consumer own the ultimate flow of the data through its full life cycle so that they manage it and guide it and control it, and how does that governance work?
MR. POLONETSKY: I’ll just say real briefly there are probably some places, yes, if I have a digital health assistant and I’m pulling my medical records and I’m making these discrete decisions that I might want it for this study and not that study because I’m actually making decisions by clicking on some version of HealthKit that I’m okay with that, I’m okay with that, you can see perhaps there being some way of making those nuanced decisions. I think in other areas it may not be feasible or practical to do so. But it sounds like Michelle had a —
MS. DE MOOY: I would say, I absolutely think that that’s important to give individuals as much control as possible. I think there’s a lot of cynicism around whether or not it’s possible, and so I think it is, and the reason I think that is because of innovations like health kit and other sort of data repositories that are privacy protective, I think the government should be more involved in enabling those, maybe not being those per say, but enabling sort of citizen comments where there is an inability to assemble and control completely their data, especially when it comes to health.
I think the considerations that you have to look at are the data types, and this sort of goes with Jules’s idea of this scale, the risk and the benefit scale, which I think is totally fair. I mean I think that is an important part of when you’re looking at how to apply regulations. But you have to look at the data types, if it’s genomic data it’s an entirely different scenario in terms of identifiability, and that’s just the reality of it.
The data uses and I think the data sharing and the data uses, I think those are the three buckets where your question has the most resonance in terms of how much control somebody has, how much protection is available. Notice I think is important in the sense that I think it’s important to give people information, especially when it comes to research. I think most people are happy to contribute their data to research, and so it’s useful for them to understand that that’s what’s happening.
But when it comes in the commercial context I think it’s useless honestly. I think they are really just liability policies, and typically I’m also one of those people who just sort of clicks yes, and I think all of us do that, and I think that it’s time to think of a new regime for how that can work, and I think it starts with people actually controlling the data right from the beginning.
And I think in health, the healthcare system is suited in a sense for this. Because of EHRs there are ways in which these sort of repositories could happen, and I think policy should as I said in my testimony think about different models for how these could happen, and I think part of it are those three buckets, looking at the types and the sharing and the uses.
MS. HAN: I will just say quickly I also think it is important not to give up on the idea of effective notice and consent. I think perhaps it’s not the only tool in the toolbox but it’s certainly a significant one, and I think it’s important for consumers, particularly when it comes to their health information, because it’s very context specific. I mean it’s generally context specific even outside of health, but certainly consumers are very comfortable sharing in a certain context, and they’re much less comfortable in sharing in other contexts. So a system that can account for that I think is important.
DR. POLONETSKY: I’d like to finish with breaking that in some extent if we’re using de-identified data, because once I’ve got the ability to revoke consent and tie it back to that individual we’ve undermined it. So to the extent we’re thinking de-identified data I’d argue likely not. In other areas there may be logic and ability to do so.
MS. DE MOOY: I think in the context of R&D this is especially important, and part of the reason we did a report with Fitbit where we looked at their internal R&D process is because this is where most of the data is an aggregate, they can tie it to individuals if they like but it’s really not necessary to do that most of the time.
And so the questions are what do we expect from commercial entities that have this health information, and one of the frames that we tried to put on a company like Fitbit is you’re a data steward, you’re not sort of this repository, you’re not this silo, this data vacuum, you’re kind of a steward of this kind of health information, and therefore the ethical considerations that come into play, the privacy considerations that come into play are their responsibility. Also the questions of bias, the questions of inclusion and diversity are their responsibility. And to their credit they were willing to take that on. I think that can be developed much more in policy as Jules talked about.
DR. PREDITH: I think your question goes right to where PCAST thought about this idea of privacy profiles so that you have an intermediary who is working with the user to help step through every step of the way where is this data going and how is it being used. And so one of the things that their report looks at is use technology to help solve this problem that is bringing this problem up in the first place. How can you actually ensure that you’re continuing to say all of these abilities that you have from big data and from analytics are actually part of the solution, too.
MS. KLOSS: We had a lot of discussion this morning on genomic data. Could you address how your perspectives apply to that particular class or type of data?
MS. DE MOOY: I think the primary point that I wanted to make was that the probabilities have changed to sort of possibilities. And the reason is fairly technical, but basically it comes down to the risk ratio changing when it comes to genomic data this intrinsically personal and sensitive data, and so the idea of linkage I think versus identifiability is much more resident in that context, and I think that requires a different set of sort of technical responses and policy responses. I think the issue of genetic data, two things. Sort of the role of it in government databases I really think is an issue that many many agencies have avoided.
I think it’s crucial, especially when you’re talking about initiatives like precision medicine where you have a question of devices being the delivery mechanism of this information, and then the government is going to maybe have access, and if it’s not encrypted or if it is encrypted, these sorts of issues really hit home and really matter when it comes to genetic information, which of course doesn’t just implicate you but everyone in your family. And so I think the issue of restricting information from law enforcement would be really important in the genomic context, which I think policy can address but hasn’t really.
I think the understanding of genomic information becoming a part of our medical records needs to be addressed now. What does that mean in terms of consent for research and other uses? If you’ve come in just to get routine medical care, and that’s sort of implicitly you’ve agreed to let your genomic data be used for research, that seems unethical. So those sorts of questions are the ones I would highlight.
DR. PREDITH: I would say the PCAST report doesn’t specifically call it genomic data as something separate that they talk about, but I think they see it as kind of one of a large number of types of information that is going to influence how people react to other people. So it is important but there is a lot of information that’s going to be gleaned from big data and analytics, and so anything that’s relevant there would be relevant to other kinds of information as well.
MS. DE MOOY: The other issue I just wanted to briefly say, and I say this in my testimony, it’s really important, is that regulators have thus far interpreted federal health privacy law to permit providers to treat whole genomic sequences as de-identified information with no ethical oversight or security precautions, and this is even when it’s combined with health histories or demographics. That’s a problem. That seems like something that policy can address and should address now before it becomes a much bigger problem.
MS. KLOSS: Shall we take a break and then come back, would that work for you Cora and Ashley? So we’ll resume with Ashley and Cora but we’ll take a 15 minute break.
MS. KLOSS: We are in our discussion period with Michelle and Ashley and Cora. I think the subcommittee though we should just feel open to asking questions and continuing the discussion this morning. I think we’ll see how we’re doing and if time permits we’ll use a little time this afternoon to begin to frame some of the issues we heard today, and I think if we’re able to do that we’ll be grateful in getting a little jumpstart toward tomorrow.
Before we pose questions to our panelists Linda is going to clarify OCR’s position on genomic data and its de-identification.
MS. SANCHES: I am Linda Sanches with the Office for Civil Rights. I did actually go down to confirm given the statement earlier. We actually have not put out any guidance as to the status of genomic data. It is a piece of health information, and I think under the Safe Harbor Provision one could definitely interpret it as being an identifiable piece of information that would need to be stripped, but we haven’t put out guidance, and we have publically stated that we’ll be putting out a request for information to the public so that we can examine the issue more carefully. So we do plan to issue guidance, but before we do that we plan to get some more information.
MS. KLOSS: I just want to make sure that the record was clear on this. Questions for our panelists?
MR. COUSSOULE: I have a question more in regards to kind of the notice and consent question. The idea of effective notice and consent, I don’t think anybody would disagree with that conceptually. The question is how do you balance off the practical reality of trying to do that with a large dataset without introducing more bias and challenge in there with an individual’s kind of right and obligation not to have their data shared from a policy perspective, how would you frame that up?
MS. HAN: It is a very hard issue. We hear from businesses all the time how difficult it is. And I think it’s a combination of things. You’ve got small screens, you’ve got sensitive information, you have got complicated things that you’re trying to express. I will say that there is a lot of work being done in terms of sort of innovating in terms of how you communicate. There are a couple of methods that we sort of set forth in our Internet of Things report. So how to communicate with consumers and provide notice and obtain consent when you’re talking about connected devices.
And that’s not exactly analogous to what you’re talking about, but for example you could see potentially a dashboard approach which might allow consumers to be able to sort of see all of the consents that they’ve given in a particular context and adjust them as necessary. I think one of the other things that I’ve heard bandied about is this concept of dynamic consent. And I think that sounds great but I admit it seems like there are definitely challenges to implementing it.
DR. PREDITH: I would just say going along with the technology angle on this just ensure that technologies can be developed with all the metadata associated with them, you have privacy profiles that are put into place so that some decisions can be made maybe not in an automatic sort of way but in a way that the person has pre-understood what those issues are, and try to create a language around what those choices and tradeoffs are. So you want to create a process that both uses the technology itself to help automate the process of making decisions around the privacy rights.
MR. COUSSOULE: So, if I envision this as a consumer, I could conceivably getting questions are asked from innumerable different places where now my information potentially is going to be utilized.
MS. HAN: So, I think one of the ways of addressing that is to think about is for companies or whoever it is who are holding the data to think about what the appropriate contexts are, are you doing something with the information that the consumer would not expect you to be doing, are you collecting something that’s particularly sensitive such that the use of that information would warrant a particular kind of notice and the obtaining of consent.
I think those are the kinds of questions to ask, because I think that’s right, otherwise you end up in a world where people endlessly get notifications and then they just start tuning out. And to some extent I think people think like that’s already happened in a certain context.
But in our privacy report and in other places we’ve said when you’re talking about something like product fulfillment or something like fraud or something where consumers just understand that you will need their information to be collected in order to have something happen, in those situations it may not make sense to actually to get consent, and sort of long complicated notices may not be necessary.
But in a situation where you might be sharing that information in a way that consumers wouldn’t expect, so you might collect it for one purpose but then share it fora marketing purpose, you wouldn’t want consumers to be surprised. And so in that situation it might well make sense to have a more robust process for getting consent.
DR. RIPPEN: So, again thank you so much for the thoughtful presentations and discussions. It’s interesting because a lot of times you find when there are new sectors that get involved in healthcare there tends to be a kind of a difference in kind of the perspectives of what is appropriate or not appropriate or fair use or not.
And I remember not too long ago during the E-Health craze where we had a major influx in the private sector that we’re not familiar with, the health field, about what was appropriate or not appropriate. And the question as far as do you regulate, how do you manage something like that.
And actually the big question of ethics and what is the ethical construct, a code of conduct that actually served as an opportunity to actually provide some sort of training or insight for people who are not necessarily used to kind of that framework in how they maybe pursue business because it’s a different approach.
And so that seemed to have worked, and I know we worked with FTC, it was the EHealth code of conduct, in addition to then what’s allowable or not, false advertising, that sort of thing. And then a third arm which was — So we had an ethical construct, the enforcement components, and the third was education, what do consumers look for or not. And now we’re in this new wild and positive opportunity with potential downsides of big data where now we have different fields that have different uses of data now being able to actually combine it.
And so it might also be good, and I know it was part of the PCAST report on the whole concept of what are the ethical principles, how does one actually make decisions and what are things to consider. Because when you start talking about regulation that’s kind of the tip of the pyramid where the ethical is the base and it’s broader because you don’t have to do a thousand permutations of what-if. So again, do you think that that is a possible approach to build upon?
DR. PREDITH: You are talking about the idea of using a pyramid model, you’re saying to build upon what?
DR. RIPPEN: The ethical principle with a component of education and then laws as they are needed for mitigating risk and getting the benefits of big data and de-identification but mitigating some of the risks.
DR. PREDITH: Just the PCAST recommendations directly, they talk about how you need to have a workforce that understands all of the privacy concerns here, and that they need to be working hand in hand with whoever, whichever particular sector they’re working in. So those individuals would be part of that ethical discussion in helping to guide the appropriate use of the data or doing the right kind of analysis or analytics on that data. That’s kind of incorporated into the workforce question for them.
DR. MAYS: I want to talk about the data brokers report. And it kind of hits a little bit of what you’re saying. I sat there and I realized how many of these things did I fill out, and where is the data. Is there not any guidance about what should be put together, like guidance that says that this is kind of crossing a line, or that when we look at like a warrantee or something it’s bigger warnings to us that this could be used for other things.
Because I guess what I’m concerned about is I understand we’re in a marketplace where the selling of data is a business, but it seems like there’s some kind of business ethics or some kind of need for a warning in some kind of way that’s not in four point type or something that allows the consumer to have a better sense of all the ways that this data might go.
Because when I fill out a warrantee I really thought, naïve me, in a company, but it’s clear it’s being sold, but I don’t know like the combination of things that get put together. It just seems like there should be some kind of ethical guidelines about how far to take this. What’s your thinking about that?
MS. HAN: I think you hit the nail on the head when you talk about the very concern of transparency that I think was at sort of the heart of the Data Broker Report. Because I think you are exactly right, I think the collection happens, and the report found from its survey that by and large consumers don’t know that it’s happening. And so the recommendations go to that, they sort of focus on how to promote transparency and how to provide accountability. But still I think that does not exist today, and so a lot of this collection still happens without consumers’ knowledge.
DR. MAYS: Is it permissible in this area to start talking about business ethics?
MS. HAN: I think it is. I think it’s certainly permissible. Michelle made a good point when she talked about how can we think about entities as data stewards. I think there’s a question though as to what incentives are there for businesses to start doing that.
DR. PREDITH: That also goes to the first PCAST recommendation around creating policies around use. So if part of the concern is around use, the data is being used for advertising or for marketing or something like that, that could be incorporated specifically into what it is you are or are not comfortable with your data going into. And I wonder if it’s possible to specify what other uses are that would or would not be appropriate.
MS. KLOSS: One of the aspects of both of the reports, the Data Broker and the PCAST, was discussion about kind of life cycle management of the data and how we need to step that up, or our data brokers need to, or any holder of data needs to think about not only getting it and storing it and using it, but at what point do you destroy it or archive it. Have you issued any further guidance on that, or what do you see as the next steps for that? Because I think we heard this morning that the longer you keep it the more it’s likely to be used for yet another purpose that perhaps wasn’t intended.
DR. PREDITH: That particular question is why PCAST was very clear what they said in a policy making process, that you just have to consider that data is permanent, that ultimately it’s distributed too widely, copied too widely, you don’t know where it’s going or what it’s been contributing to.
At this point the way that we control and use data, if there were advances in technology that allowed you to track and do this kind of auditing process that they talked about, maybe it would be different. But that data is still going to exist out there. And so they just way down again on the use question, like you have to be very clear about what you do and do not want that to be used for.
And we were talking during the break also, there is the idea that so maybe you just don’t do this, maybe you don’t bring the data together, maybe you don’t do the analytics on it, and maybe you just kind of avoid this problem altogether, but then you lose out on all the possibilities that there are.
MS. HAN: I would say that one of the things that we talk about generally with respect to data security and also the internet of things is reasonable data minimization. I think what you’re saying with respect to don’t collect more than you need, and when it ceases to have sort of a business use then get rid of it. And so for example in the data broker context it may be for example that an old address is not useful for marketing, but for a different type of product, like a fraud product, maybe it is useful, and I think the report talks about that.
But I think in addition that is part of I think thinking about the lifecycle is part of building in privacy by design, it’s thinking about sort of your product from development through implementation through the end, through all the different things that might happen to your business. And so that’s another I think important consideration to the extent that there kind of can be protections that are built in there.
MS. KLOSS: Any other question?
MS. BERNSTEIN: I just want to ask before you guys leave, are there upcoming project reports, hearings, whatever that you guys are doing in your work that we should know about, that staff or members here would benefit from, are you moving onto other topics or are things in the offing in the next months or year that you are at liberty to tell us about?
MS. HAN: I can’t think of anything off the top of my head, but let me go back and talk to people in the office to make sure I’m not overlooking something.
DR. PREDITH: There are two PCAST reports that are ongoing right now that might be of interest here. One of them is on forensic sciences, so how do you use genomic data, how do you use other kinds of forensic evidence to make determinations about what is happening in criminal proceedings, and then there’s another one on science and technology for water and how does the testing of water, what do you know about water, what do you know about the infrastructure of water and how is that affecting the health of people. And I could see that also is around kind of data, data collection, monitoring, things like that, so that may be of interest.
MS. HAN: It’s the importance of the need to focus on the data not in the here and now, but in the future. And to the extent that contractual provisions can be helpful that’s one tool, to the extent that contractual provisions can be helpful that’s one tool, to the extent that contractual provisions and audits might provide additional protections I think those could be helpful tools.
DR. PREDITH: Right now obviously genomic data is really important and it’s kind of the thing that’s in front of everyone’s faces right now, but it could be that 30 years in the future you find out that actually people who lived at that past old address plus some other factor plus some other factor is actually the kind of weird combination of factors nobody ever would have thought about to be discriminating against people on.
The PCAST report intentionally had a very forward looking view, and was hoping that, don’t think about just what is the ability of analytics right now but where is the state of technology going to be in 30 years or 50 years or whatever it is and try to think about that also.
DR. PHILLIPS: So, there is increasing pressure to collect social determinant data about patients, and there’s an increasing acceptance of lowering the vale about behavioral health data about patients. There’s even more potential sensitive data coming into medical data that we collect routinely. Does that, in your forward looking view, does that come up? Because those have a lot of relevance for discrimination, for you name it. Has there been thought about that?
DR. PREDITH: The PS Report doesn’t specifically draw that out but it just goes back to the idea about policies around use. So to be very clear about how and when and what those data are used for, and to be very clear about that.
MS. KLOSS: It underscores the discussion we had earlier, and by some of this panel, that increasingly a little more interagency view is helpful because we’re not just staying in our neat buckets anymore, we’re spilling out of our buckets with this.
MS. HAN: I think it is already happening, and I think it will increasingly happen, and to the extent that robust de-identification can work and still allow the data to be useful, I think it also can be an important protection.
DR. PREDITH: PCAST does not believe that ultimately de-identification, you’re just going to be able to identify people, ultimately there’s just going to be too much information.
DR. RIPPEN: It is an interesting thing to see with regards to health information, which again can be very sensitive information that the movement for de-identification for public good and making it available for others to leverage and to make discoveries and improve our health.
Has there been a lot of discussions about other sources of data also being the same? Because if you think about healthcare there’s private sector healthcare, there’s public, but it’s all lumped together, and if you think about some of the other data sources, be they mobile, be it the social determinants of health, the types that we know are extremely important to health outcomes, and actually moreso than healthcare delivery itself, do you think there will be a trend to making all data available, even from the private sector, for public good?
MS. HAN: There are numerous additional sources of data, there are websites, there’s wearables, there’s all sorts of things. I think that there should be privacy and security protections for that data. And there could be a variety of forms, and there probably will need to be a variety of forms. So I don’t really think, from what I’ve seen so far, that there’s a movement towards just making it all kind of out there and public. Because I do feel like it is sensitive data for consumers.
DR. PREDITH: Certainly the PCAST report was not specific to healthcare at all, it was talking about data everywhere. Certainly companies make money off of data, we need a private sector and we need an economy.
DR. MAYS: I just want to pick up on the comment about the social determinants data. I was just thinking that when we start thinking about as people are sicker they have more encounters, and we need better and more data. And what’s going to happen over time is there will be a group of people, probably predominantly people who are poor and people who are racial and ethnic minorities who are going to have incredible datasets.
If we really do what we say that we’re going to do we need to monitor transportation, we need to think about if they can’t get in having wearables that we get data from, using wireless that we monitor them from because they can’t get in.
And it really is time to kind of think about we need that data, but should it just be like everything else, because it’s a group of people for whom feelings of discrimination, being monitored, et cetera, often will cause them not to come in. For them to hear oh your data is just out there and it’s wonderful may for example end up creating more of a problem about telling people things and wanting to participate in healthcare the way we want to.
I mean it would be very forward thinking for us to kind of think a little bit about what it means in terms of making sure that as healthcare gets more open and technology is more available that we think about again protections, privacy, security, and ways in which some of what we’re actually prompting maybe doesn’t happen as easily without people being able to consent to it. So I’m just wondering what you think about exploring those issues more and whether you think that we should think a little bit more about vulnerable populations in terms of this kind of access of data.
MS. HAN: One of the things that we mentioned in our Big Data Report is that when you’re doing analyses you need to think about if your population sort of includes everyone that you need it to include. And so there may be populations that are either over-represented or under-represented. In the development of your analysis you should be considering those things.
DR. PREDITH: I will add to that and say that the PCAST Report doesn’t specifically call that out, but they talk about how the use of this data can be used to discriminate against people and a number of different factors beyond what is even done now.
So even as you get down to precision medicine and everything you know, you can go down to like N equals 1 discrimination because this complicated set of factors that is unique to that person could be used for discrimination. So certainly keeping everybody involved and in the conversations around the use of this data and where it’s going is going to be critically important to making sure the right policies are put in place.
MR. COUSSOULE: I think it even comes in to, if you think about the treatment, payment, and operations model where the use is allowed, that’s even changing, because the idea of what is treatment. You didn’t use to think of that home monitoring as a treatment kind of issue. Even as we move forward, the technology changing and those kind of things, the way it was originally structured, it’s not the same blocks anymore.
MS. KLOSS: So, we will continue among ourselves as a subcommittee, and certainly any of our morning panelists can join us. But I think it would be useful just to do some reflection on what we heard today. I’ll give you five minutes to go through your thinking and pull out two or three new thoughts, and let’s see if we can’t go around and start getting some of those out, because they’ll help us tomorrow as we frame the issues after lunch tomorrow.
MS. KLOSS: Let’s get started.
DR. RIPPEN: I think the question of intent, it was brought up recently as it relates to the collection and use of de-identified information, and that also includes the question of re-identification, is one.
PARTICIPANT: How does intent relate to the re-identification?
DR. RIPPEN: Intent of use. If you have de-identified information, what is your intent to use it? So the intent is to add other data because you want to do some great intervention to say improve health, or is it because you want to re-identify? And then it goes to intent. The other one is really kind of the changing nature of de-identification tools and kind of the need for expertise and dissemination.
MS. HINE: So, Helga and I were talking about how to frame this, and some things to keep in mind is that this is such a huge field and we saw just today that the FTC is already very much involved, the Office of Science and Technology Policy at the White House involved. So what is it that the committee thinks someone doesn’t have an eye on? Where can a recommendation go to HHS that will actually be actionable and have impact, and it’s not already being dealt with? So to try to help put some banks on this huge body of water because there’s just so much.
MS. KLOSS: And we will get to that narrowing but I think it is important to see what we learned today and then to reflect on the questions we pulled together and say what haven’t we tapped or not tapped enough. So I completely agree.
DR. SUAREZ: I think to me there were a number of important messages. One, and to me perhaps the most important one, was the level of divergence in terms of the volume, in terms of how big this is a problem. It seems like on the one hand we have some evidence that this is relatively limited in terms of how much and how many cases have been documented.
But then on the other hand we have evidence also that shows that this is a real problem and a real situation, and that a lot of it might be of situations that are not well documented and reported, but clearly that there is re-identification happening, there is re-identification happening for many different purposes that usually when it comes to, we frame a number of things that we said around research here.
Usually within the research community or for treatment or payment related functions that’s not necessarily the big issue, it has to do with other elements in the industry and other factors in industry, and so I think it would be helpful to I guess further understand where is it that the problem really exists to be able to provide some of the advisement to the secretary.
But generally I think I mentioned earlier I see two policy areas where we can provide advisement and as I was hearing all the testimony I was helping myself take some notes about each of those areas. So one area is really on the place where the data is disclosed, who would disclose the data and really identifying some possible additional policy considerations around that, reinforcing the mechanisms around protecting the data.
So that’s one area, not just simply de-identification and better methods and better guidance and better tools, but also better ways of the entity that is disclosing the data controlling it. On the receiving end also having greater expectations about the controls on that end user around ability to re-identify and restrictions on that and that kind of thing. So I think we have two spaces, at least two, where we can consider policy recommendations.
I was very enlightened by the fact that there’s several different agencies working in this space, and I think one other area that we should think about making recommendations is around the increased coordination across agencies that have some degree or level of jurisdiction over the re-identification of personal information, personal health information, protected health information, et cetera. I’ll leave you with those two.
DR. PHILLIPS: I think a lot of what I took away has been said already. I went back to Dr. Garfinkel’s presentation, and I liked the fact, he emphasized the robustness of de-identification, adding noise, swapping as I think Dr. Malin said hiding in plain sight, or creating fully synthetic data. Keeping up with the technology about how you make it more robust is important.
At the same time, I was thinking with the PCAST notion that policies and regulations should be stated in terms of intended outcomes. So I think you have to balance both of those, how do you make it robust enough but also how do you clearly say what the intended outcomes are and the expectations, because that flows into the kinds of ways that you’re going to disclose them.
So I also, Ira’s paper was very helpful for me to helping understand the two sides of this risk concern, and that using the statistical disclosure limitations and figuring out is there an internal or even an external group who should have direct access, and do you protect that with IRB and strong either legal or contractual language, do you have a dissemination model and a query based model, both of which are kind of voltage drops in the quantity of data or the identification capacity of the data, or do we even move to the extremus of moving closer to the law of data security that FTC has employed? All of those are lessons for me in a way that I think we can shift this policy and update it meaningfully.
DR. EVANS: The presentations were so excellent and I also was struck again by the robustness of safe harbor de-identification. It’s useful, it’s just in theory breachable. But more broadly is it going to be tenable to continue to look at the covered entity as the person who bears legal responsibility to protect the privacy of data? How re-identifiable something is is in part a function of the attributes of the data and how it’s held by the covered entity. But I was struck this morning that it’s also very much a function of the global data ecosystem in which it exists, or I think Brad Malin wanted to use the global data sphere.
Is it going to be tenable to keep making the covered entity the focus of regulation when in some degree they’re powerless to affect how recognizable data will be because they exist elsewhere? It’s sort of like obscurity, obscure people don’t get recognized when they walk down the street, famous people do, and it’s because famous people’s faces are all over the place. Individuals’ data is now all over the place, and the responsibility for the privacy risk is in part theirs and in part the ubiquity of their data.
It just resonates with me that at some point the approach taken in HIPAA that says you, the covered entity, are responsible to protect privacy, may cease to be tenable. And I don’t think we can fix that problem, but I think, and this resonates with what Rebecca said, I think we need to be very clear what are we trying to do here. I think it would mislead the public to say oh we’ve had the subcommittee and we make recommendations and this will fix everything.
I think we’re in a time where we need to start disclosing to the public that the covered entity can only do so much to protect you, and it’s going to depend on what you do, and that goes with these calls we heard in the afternoon, we need to be very transparent with people so that they know when they get that affinity card at the store they are selling out some of their own privacy and help build people’s individual responsibility. And I think hence forward we do a disservice to people if we suggest that HIPAA can fully protect you by itself. And I’m sorry I’m waxing so long but I found the meetings quite fascinating.
The thing about the comment that is going to be a whole process and it may require regular updating, I think this notion of a more process oriented regulatory framework may be the best we can do, and add a lot to it. And I will stop there and give others a chance. But I found this a very rich and thought provoking session.
MS. MILAM: I thought the discussion around the lack of a definition of risk was interesting, especially appropriate levels of risk. And the different views that were expressed. So I think for us to understand if a dataset is too identifiable we need something of a gold standard as to what it should be.
Because we know that, we heard that most people would view some risk as acceptable, that no risk will prevent data use. And so how much risk is okay. So if we can get that to some sort of percentage then I think we can better know from a data user standpoint or a covered entity custodian standpoint, what SDL techniques we ought to apply on top of safe harbor. So how do we know when we’re done type of thing.
I think we also had an interesting discussion around notice, how to make it meaningful. Or first of all when should it be used, when does it really serve no purpose, and how can it be meaningful and understandable by the average consumer, and I’m wondering if we might benefit from looking at other models that work well in informing consumers.
MS. SANCHES: I just want to clarify that when a covered entity releases a de-identified dataset that is de-identified consistent with the standard they are no longer responsible. So if it is re-identified it is not held back to them if they have actually followed the requirements of the rule.
DR. EVANS: We understand there is no enforcement action at that point though that’s a good clarification.
DR. RIPPEN: If you are the guardian or the entity who ultimately is charged with protecting the privacy of information about the people they serve, if it’s not sufficient and people do combine it there’s no legal requirement but there are actually other ones that actually may be held more dearly to healthcare systems based on kind of their mission and their goals.
So I agree there’s no legal or HIPAA requirement, but again it goes back to what would the purposes of releasing de-identified data be, and then building on some of the other comments that people had, which was a multi-tiered approach to mitigating. So again and whether HIPAA is sufficient any longer.
MS. KLOSS: And the question that got raised this morning is that okay, once it’s deidentified it no longer falls under sort of anything. That was one of my takeaways. We’ve done our job well, but we throw it out into the world and hope everybody else treats that data as we would as a steward or I don’t know.
So that led me to kind of my number one, which is this issue of we’re sort of at this point of public policy collision between privacy and use, and that one way that we might consider is not necessarily selecting a new and better mousetrap for de-identification, but strengthening the process environment in which de-identification exists not only in the originator, but also the receiver and any downstream users. That was my aha for the day.
MR. COUSSOULE: As I said earlier today, I think kind of pointing the finger at the distributor of the data is very similar to what has happened over the last however many years in the general security context of information disclosure. You have a security breach and the blame is all placed on the person who had the data as opposed to who perpetrated the bad, the wrong.
Now I’m not trying to say there’s no accountability there, but I think there does need to be a balance between who has the data and took due care to do that without getting into legal terms, I’m not an attorney, versus who actually did the thing that was negative and is there a downside and a risk, and that leads into another one for me which was I guess I hadn’t thought through this well enough, and maybe still haven’t, but there are very different rules applied to the same data in different context, and that’s a real challenge.
And if we’re talking about de-identification as it pertains to HIPAA, all the HIPAA guidelines are pretty clear, but once you leave the HIPAA realm they’re not anymore, or they may be the same and they may be very different.
That’s a tough one for me to reconcile, depending on where I’m sitting or what I’m doing I have different sets of rules, and frankly different legal risk. So that’s another one and then the other, I guess the last piece for me for right now is more about I think and it’s in one of the reports that we’re looking at, the focus really needs to be on what the actual and targeted uses and intent is as opposed to how do we collect, how do we gather, how do we do that.
And I think really from a policy perspective getting the focus pushed onto the intent side I think is a really big deal. And so I think we need to be thinking about the outcome side of that and what the objective is on kind of the back end of that as opposed to here’s what you do on the front end.
And I think the front end, because the technology is going to change a lot, the rules will change, the volume of data and the sources of data are growing every day differently. And focusing on that I think would be perpetually behind the eight ball as opposed to thinking enough, a little bit more maybe about the outcome and the usage side. So that’s my two cents worth today.
MS. KLOSS: Thank you. Vickie.
And I think we need to be in the space of differentiating data that’s the research side and data that’s kind of marketing commercial side. Because I think where there’s not a lot of guidance, nor a backdrop of protection, is when we start talking about kind of the marketing and commercial side.
And it’s not talking about it in a way which is to restrict it, but it’s to talk about it in the way in which it’s about expectation of a data stewardship or approach to it, of helping people to understand both consumers as well as those who are bundling this data up kind of how it can backfire to some extent in terms of people being upset and what have you.
And then there was one recommendation that I thought we should be in the space of, and that was Ashley talking earlier. And that is I was really struck by the notion of how much science there is about these issues, and it’s not moving to the practice.
I think we could be in the space of talking about a kind of best practices, best approaches, even one of the recommendations that I thought about is if you look at who is supposed to do translation, it’s the CTSA, the Clinical Translation Science groups. And even having them pick up part of this agenda would I think be helpful. But I really think we could do something about this translation issue, that that would be a good space for us.
DR. RIPPEN: I think something that was fairly powerful though I think that people had different views of it that was presented today was, and again it goes back to intent, is the question of restrictions against re-identification, so again if one of the intents is not to allow it then that’s a clear highlight. The other is that we really shouldn’t forget the risks of mis-identification, because I think in many ways that can have opportunities for more risk, more negative sequelae than other types of re-identification.
And then I guess the final is to reinforce what was stated before about this notion of thinking across a government as it relates to information that might be relevant to its area of concentration, because as we all know and we’re all probably very active in this whole question of combining health information with other types of information so that we can more effectively address challenges in population health, makes even the kind of data that we’re collecting either in electronic health records or as data marts that we integrate more and more sensitive, and again making sure that we also abide by transparency to the people that we serve too as we start being collectors of big data also. So just kind of flesh it out a little bit.
DR. PHILLIPS: The more I thought about it today, Dr. Malin’s suggestion of a bill about restrictions on de-identification but with carve outs has me very anxious. When you create civil or criminal penalties the fear of that grows so large that you spend half a decade trying to sort out what is allowed and what’s not. So I’m just very cautious about going too far in that direction.
MS. BERNSTEIN: The proposal of the bill by Bob Gelman that he suggested was not to directly regulate but to support as I understand the bill which I haven’t looked at in quite a while, to support contractual obligations and make it legal to force people to have contractual ways to prevent people from re-identification and so forth, and not to directly regulate like we would in some other cases. It’s not a direct regulation, that’s all I want to say about it.
DR. EVANS: I very much support what was just said. Re-identifying data is not in itself a wrong, just as recognizing someone’s face and putting a face to a name is not a wrong. It’s what’s done with it afterward. And there are just so many situations where it is very desirable to re-identify if you have detected a really dangerous incidental finding while working with de-identified data and want to notify the person that they need to have follow-up care, it might save someone’s life. So I hope we can de-escalate and get away from this notion that re-identification is wrong in itself, it isn’t, it is what’s done afterward.
MS. BERNSTEIN: Do you want to know what the staff thought?
MS. KLOSS: Sure. Then we are going to go to public comment.
MS. BERNSTEIN: So, two things that struck me. One is I think, now I’m trying to remember who talked about, maybe it was Jules, the expanding concepts of research. Like we used to have this idea that research was a formal thing that happened in academic institutions, and he said in civil society or with agency funded grants, but research is maybe much more broad than that, and we should sort of have an expanded view of how we look at research. And I say that out loud with some trepidation while there is still the common rule floating about.
The other thing was things that we can actually do something about, which is we can use the convening force of either this committee or the department’s role to get the pragmatists and the formalists together in a room. Like some of them were here today, and maybe we can hear from a couple of the formalists tomorrow, but not on the same day in the same room.
And there are some spaces where those people get together, but we pointed out that the pragmatists are more like computer scientists, and the formalists are more social scientists, and they work at different agencies, and they don’t necessarily go to the same professional conferences and meetings.
There is one actually coming up next week, the Privacy Law Scholars Conference where some of those do get together, but it’s something we could actually maybe do something about, which is convene that kind of a group of people more even than we’ve done, than Rachel has done a fantastic job today and tomorrow, a more direct way of having those people talk and figure out what tools that they have that they can offer the government.
So then it supports what Walter was saying about multiple agencies working in this area. NIST is actually going to convene something related to this on June 29th. So a government de-identification stakeholders meeting that’s hosted by NIST on June 29. So they’re going to convene those federal agencies that Walter talked about, multiple ones of us kind of working in this area and trying to get some yeast in that process.
MS. SANCHES: When I met Simpson and I asked him oh so what are some other de-identification frameworks, what else can we look at and he said that actually this is it, the HIPAA de-identification standard is the only one. So help you can give us on how it might want to look, and given technology changes is always helpful.
DR. GARFINKEL: So, there is an organization called High Trust which you may be familiar with from the HIPAA privacy security rule, and High Trust has recently released a de-identification framework, and it’s particularly relevant because it has a training component, it has an auditing component, it has a continuous monitoring component and it has a certification test.
PARTICIPANT: So it’s a product.
DR. GARFINKEL: High Trust is both a nonprofit organization and a for-profit organization as I’ve been told. And so they have a work product that the nonprofit organization has produced, and then they have a licensed trademark that the for-profit organization licenses out. But I’m not here as a representative of High Trust. So that is a framework.
And there is of course the FERPA has a de-identification standard, the Family and Educational Rights and Privacy Act. But it’s not as well developed as the HIPAA standard is. And then there are several other laws that mention de-identification, but they don’t explain what they mean by that.
MR. COUSSOULE: High Trust has a framework around it, but it’s all based on the HIPAA guidance.
DR. GARFINKEL: I have been briefed on it.
MS. SEEGER: I think we should table the High Trust certification framework for tomorrow, because in looking at Kim Grey’s testimony from IMS Health she does touch on it, it’s the very last presentation of the day tomorrow.
MS. KLOSS: So I think it is time to go to public comment. Can we open the lines and see if there’s anyone on the phone or is there anyone in the room who has a question or a comment? This is the NCVHS Privacy, Confidentiality, and Security Subcommittee Meeting on De-identification. Are there any questions from the public? Comments.
Hearing none, the meeting today stands adjourned. And we reconvene at 9:00 promptly. Thanks to everybody.
(Whereupon the meeting was adjourned at 4:55 PM)