[This Transcript is Unedited]
Department of Health and Human Services
Subcommittee on Privacy, Confidentiality & Security
National Committee on Vital and Health Statistics
“De-Identification and the Health Insurance Portability and Accountability Act (HIPAA)”
May 25, 2016
Hubert H. Humphrey Building
200 Independence Ave., SW
TABLE OF CONTENTS
- Opening Remarks – Linda Kloss, Chair
- Panel III: Approaches for De-Identifying and Re-Identifying Data
Speakers: Vitaly Shmatikov, Jacki Monson, Jeptha Curtis, Cavan Capps, Yaniv Erlich
- Panel IV: Model for Privacy-Preserving and Use of Private Information
Speakers: Micah Altman, Sheila Colclasure, Kim Gray
- Public Comment
- Subcommittee Discussion: Review Themes, Identify Potential Recommendations and Additional Information Needs
P R O C E E D I N G S (9:04 a.m.)
MS. KLOSS: Good morning, and welcome to the second day of our hearing on de-identification and HIPAA. This hearing is convened by the Subcommittee on Privacy, Confidentiality and Security of the National Committee on Vital and Health Statistics, and we welcome all who are here in the room and those who are joining by phone.
We have a little formal business to do first as part of the advisory committee protocol. We will introduce members of the Subcommittee for the record and then we will commence our business.
My name is Linda Kloss. I am a member of the Full Committee. I chair the subcommittee, and I am also a member of the Standards Subcommittee and I have no conflicts of interest with the topics of today’s business.
DR. RIPPEN: Helga Rippen, Health Scientist, South Carolina. Clemson University and University of South Carolina. I am on the Full Committee and am also a member of the subcommittee, also on the Population Health Subcommittee and the Data Working Group. I have no conflicts.
DR. PHILLIPS: Bob Phillips, American Board of Family Medicine. I’m on the Full Committee, this Subcommittee and the Population Health Subcommittee and no conflicts to report.
MS. MILAM: Good morning, Sallie Milam with the West Virginia Health Care Authority. I’m a member of the Full Committee and the subcommittee and I have no conflicts.
MR. COUSSOULE: Nick Coussoule with Blue Cross/BlueShield of Tennessee, member of the Full Committee, this committee, and the Standards Subcommittee, and I have no conflicts.
DR. MAYS: Good morning. Vickie Mays, University of California, Los Angeles. I’m a member of the Full Committee, Pop and this, and I chair the Workgroup on Data Access and Use. I have no conflicts.
MS. KLOSS: Do we have any members of the Committee on the phone?
DR. SUAREZ: Yes, good morning. This is Walter Suarez. I am with Kaiser Permanente. I am the Chair of the National Committee on Vital Health Statistics, a member of the subcommittees and workgroup, and I don’t have any conflicts of interest.
MS. KLOSS: Thank you, Walter. Any other members of the committee on the phone?
We do have Barbara Evans in the room with us but she just stepped out so we will have her introduce herself in a moment. Could we have the staff introduce themselves?
MS. HINES: Good morning. This is Rebecca Hines. I am with the CDC National Center for Health Statistics and I am the Executive Secretary of the national committee.
MS. SEEGER: Good morning, I am Rachel Seeger. I am with the Assistant Secretary for Planning and Evaluation and I’m the lead staff to the subcommittee.
MS. SANCHEZ: I am Linda Sanchez. I’m with the Office for Civil Rights and I am staff to the Committee.
MS. KLOSS: We will have our panelists introduce themselves when they speak.
(Introductions around the room)
MS. KLOSS: We had an outstanding day yesterday, and we know today is going to be equally on point. Our goal, as you know, with this hearing is to understand HIPAAs de-identification requirements in light of evolving practices, and we picked up a lot of new concepts yesterday — the pragmatists and the formalists, and just a full range of issues. There’s really a lot going on in the learning space.
Today, we are going to really focus on approaches and models. We are relying on our panelists today to help us bring the broad learning from yesterday back to enable us as a subcommittee to consider what kind of recommendations at this juncture we might prepare and propose to the Full Committee and then perhaps to the Secretary. We are a convening committee but we have the authority to make recommendations directly to the Secretary, and we are really looking at you all to help us guide what those should be at this moment in time that will be forward-looking.
MS. KLOSS: With no further ado, we will commence Panel III, which we have titled Approaches for De-Identifying, but we know these topics don’t fit in neat categories, so we expect and hope that you will share your perspectives broadly and then help us narrow in on recommendations.
What we will do is go through the formal statements and then we’ll convene open discussion. We are going to start with Dr. Shmatikov. We welcome you, Professor of Computer Science from Cornell.
DR. SHMATIKOV: Thanks for giving me the opportunity to talk a little bit. I’m going to talk about a few topics for a few minutes related to de-identification and why it started its life as a fairly reasonable protection technology but there are certain factors that have appeared lately that it does not seem to take into account, and why protections provided by de-identification seem to be very brittle in this day and age. I also will talk a little bit about what’s on the horizon and what new technologies are coming to the fore that promise better protections.
The main problem with de-identification, in my mind, is that it looks at data in isolation. De-identification tries to look at a dataset and then somehow do something with it that makes it safe for all possible users no matter how people are going to try to extract intelligence from the dataset the de-identification is supposed to provide an answer.
The reason this doesn’t work very well is that there is constantly a new stream of public sources of information about individuals — public databases, social media, all kinds of sources of genealogical information, all kinds of sources of information that make it very easy to take a de-identified dataset and link it with these sources of information to reconstruct identities and learn information about individuals.
What makes this problem so hard for de-identification is that it’s kind of difficult to say which attributes are actually identifying. In de-identification, people typically assume that you can just pick a few attributes like names and social security numbers, maybe demographics, do something with them, and the dataset becomes de-identified, but in reality, any combination of attributes can be identifying. And there are public sources of information that contain not just demographic information but information about people’s preferences, what they did about their relatives, all aspects of their life, so, effectively, any combination of attributes can become an identifier.
That is a threat that existing de-identification techniques just don’t seem to take into account even though by now we have many examples of datasets that have been re-identified using these techniques. So that is one problem with de-identification.
The other problem with de-identification is that it separates data from users of data. Somehow, it is applied to the dataset but it hardly ever talks about what is actually going to get done with the data. Some forms of de-identification destroy utility of the data and make it very difficult to use it, for example, in a biomedical context or do useful things like clinical studies. Others maybe preserve the utility of the data but do not actually guarantee any privacy because of the external linkages.
I feel that the conversation about de-identification of data and users of data — these two conversations cannot be separated; we can only talk about them holistically. We need to talk about the process of using the data and what it means to use the data in a privacy-preserving way, or in a safe way. We cannot talk about de-identification in isolation from the possible users of data.
This is especially important these days because there’s a lot of progress in machine learning. This is a technology that’s really taking over the world, and big Internet companies are now developing machine-learning platforms that are extremely powerful, much more powerful than anything we have had before. So we really need to understand what it means to apply machine learning to data in a privacy-preserving way.
The existing techniques for de-identification were developed before machine learning became a big thing. They don’t really support machine learning in any kind of reasonable way, and that’s a problem because that’s how intelligence will be extracted from the data in the years to come, using advanced machine learning. So, somehow we need this integrated approach that looks at how the data is going to be used and tries to understand what it means, for example, for machine learning to be applied to data in a privacy-preserving way.
Another component of this is technology has limits. Not all threats can be prevented by technology. There’s a very important role to be played by law and regulation in this picture. It is important to understand where the limits of technology are and what kinds of threats can be prevented, and some things can only be deterred using law and regulation. The focus needs to be on harm to individual users, bad things that could happen, and some of these, if they cannot be prevented technologically, that needs to be recognized and legal and regulatory defenses need to be developed for that.
The good news is that there has been a lot of work in the research communities — computer science, machine learning communities — to develop technologies that can help a little bit with data protection, maybe even more than a little bit. Definitely machine learning, which I already mentioned, is a big deal in understanding data. In that community, there is a lot of interest in how to do machine learning with privacy, how to design machine learning algorithms that don’t leak too much information about individuals’ data.
There is some work on secure computing, on using secure hardware — lab forums like Intel’s SGX. This is very preliminary but at least it promises to have some kind of technology that allows some computations to be done in a sandboxed environment where only controlled information leaks from it. These are the kinds of things that computer scientists are working on a lot, so I hope to see very good progress in this area in the next few years.
MS. KLOSS: We will hear all the statements and then commence with discussion.
Our next presenter is Jacki Monson, Chief Privacy Officer, Sutter Health in Sacramento. Welcome, Jacki.
MS. MONSON: Thank you, and good morning. I was told this morning by Linda that I should share practicality, so that’s what I plan to do. There are really four categories that I want to talk about. The first one is transparency for patients. The second is the importance of data for medical research and innovation. The third is the current state of de-identification and some of the opportunities that I think we have. The last is the benefits and concerns for re-identification and some ideas I might have for that.
The first one and, I think, the most important one, is transparency for patients. We heard a lot about that yesterday, but there isn’t a standard approach to transparency and what that means, which means that it is at the discretion of the organization, the institute or anybody who is using that data. That is, I think, where patients are most concerned because they don’t know how their data is being used and disclosed.
Oftentimes, I spend a lot of time getting into conversations with our patients about what that looks like and how their data is being used. When I go into detail explaining it, they’re actually very comfortable with the use of their data, versus, when they don’t know, they often revert to the defensive mechanism which is my data is really sensitive and I don’t want it used. We know that data is extremely important to medical research and to helping the health of those patients so we absolutely need their data.
The other concern is extra protection needed for sensitive data. I’ll give you an example. California state law is pretty restrictive as it relates to sensitive information, and oftentimes, when it’s genetic or there are sensitive conditions related to patients, those are the ones that we need to conduct medical research on to have the ability to find a cure. In California, for example, HIV status — we are required to get written authorization before we can use it, even for health information exchange, let alone use it for medical research or other purposes.
It is really challenging, and it doesn’t mean that I’m saying we should not get authorization. I think there need to be better mechanisms to obtain an authorization that is not having to track down every single patient and getting a paper authorization. I think there should be better mechanisms. I know we have just started to see electronic consent, which I think is a better mechanism, but we need something I think even better than that.
Precision medicine. As we know, that’s kind of the future, and we need to be able to tailor it to a patient. In order to do that, we have to have their data and be able to be very transparent
I like the idea I heard yesterday about a tiered approach. We have seen a lot of opting in and opting out as being very useful. We’ve done that for HIEs and it has actually worked really well. I think that a mechanism like that which is a little less than an authorization but still informing the patients about what their data is being used for, even in de-identified form — when there is any data left they care, and they want to know how it’s being used. I think that light of an authorization and instead looking at opting in and opting out might be more beneficial to our patients and provide them that transparency that they’re looking for.
The second topic I’m going to move into is the importance of data for medical research. When I was preparing to come here I had lots of conversations with our research organization and our physicians around this. We need data to provide the best quality care and also to address patient safety, which means we need to be able to share it not only within our organization but across other organizations, so the concept of big data.
However, we have a lot of challenges with that. With interoperability, everything is available electronically. But, if we need to make it a limited dataset or we need to de-identify it, we can’t do that electronically, which means that we’re exporting that data, which is often very clunky, to other sources to be able to come up with a limited dataset or de-identified, and it is not practical. As we push the electronic environment and interoperability, that is not really supporting that viewpoint and it’s not helping us get there to really achieve what we need to achieve for big data and focus on research.
Particularly with precision medicine, if that’s truly the next generation — which I think it is — we need to be able to link accurate data electronically, and we don’t really have the ability to do that to the desire and the need and meet the regulatory requirements. And it’s particularly challenging, if we do de-identify it, to be able to try to re-identify it if needed and appropriate in that situation. It’s basically impossible to do that today.
The next topic is the current state of de-identification. I think healthcare has some of the best guidance, yet, I don’t feel it’s adequate. For example, the expert method — it isn’t accessible without a significant investment of the organization. Oftentimes, when we have research, we have limited funding for that particular research initiative, and with that limited funding, when we’re talking about statistically de-identified and approaching an expert, it’s very expensive. So, what happens is oftentimes that research is not done, not moved forward, because the cost-benefit analysis of how much it’s going to cost us for that expert is going to take away that entire research budget which has limited funds to help our patients.
It doesn’t address population health. It doesn’t address system testing as we work towards interoperability, operation uses, external research. The 18 identifiers are subjective, and analysis is always needed. It has been interesting at our organization — my viewpoint of the 18 identifiers is very different than, for example, an IS professional who might view metadata and other types of data as identifiable or could be identifiable but, from my perspective, isn’t even something that we look at. So it needs to be more flexible but yet provide enough guidance that we know what we should and should not being doing.
One of the things that I think would be really helpful is to actually have use cases that provide perspective, similar to Office for Civil Rights. They have Q&As as it relates to the privacy and security regulations, and it might be very useful as it relates to de-identification to have Q&As on case studies so that we can look at it, — whether it’s population health, precision medicine or other types — and say oh, yes, we’re doing this exactly, and this is what their guidance is around this. We do have that flexibility. Versus, a lot of times what we revert to when we don’t know because it’s a gray area is we’ll go to that statistically de-identification and look for that expert, which again is really expensive,, which means that we revert from it often and don’t proceed with that medical research. As I mentioned before, it doesn’t contemplate innovation and the challenges with the electronic data, which means that we have to pull it.
We have seen recently some new technology — private analytics is I think one of them — that claims to do statistically de-identified data. The question I have as we review those types of software is, is that really going to meet our needs, or is that going to meet the needs of the regulators or meet their requirements? Maybe it’s not, maybe it is. I think what’s interesting about it is that could be our mechanism for doing it electronically more efficiently if the technology existed. That could provide that opportunity.
I’ll just mention one other thing related to vendors. As a privacy officer, it absolutely keeps me up at night dealing with vendors and the opportunity for re-identification. As you heard alluded to, the social media and the ways we can re-identify data that’s publicly available, that is definitely a concern of mine.
What we have done is put in every single business associate agreement that we execute, beyond the HIPAA regulations, that regardless of whether it’s a limited dataset de-identified, it is still our data and we still want control over it. And only if it’s a benefit to our patients or it’s for quality purposes of the organization do we authorize in writing that they are allowed to use it for those purposes and only those purposes, and if it’s for something else, they have to get our authorization or approval to be able to do that.
The practicality of that I don’t know. We have 5,000-plus vendors, so am I conducting audits to make sure that they’re actually not re-identifying data or using it for those purposes? No, I am not. So that’s definitely a concern that I have, just looking at the practicality.
And then, what about the subcontractors? We know that many of our vendors use subcontractors, possibly overseas subcontractors, so making sure that they can be bound to that information, and there is no regulatory guidance to provide. Oftentimes, when we’re in discussion with vendors they like to point out that we are more stringent than the regulatory requirements and are concerned with our need or want to be able to do. And we feel as an organization that we don’t have a choice because there isn’t any other guidance to provide us that flexibility or that enforcement when we’re concerned about re-identification.
Now I’m going to move into the benefits and concerns of re-identification. I have mentioned a few of them already. I think it’s a combination of controls that are needed, both the regulatory guidance or policy from the regulators, as well as policy. I think that is absolutely needed to make sure that in situations where it’s a benefit — for example, if it’s a benefit to a patient to allow it to be re-identified because we’re looking at some genetic conditions and we want to focus on finding a cure for that genetic condition, we should absolutely re-identify it to be able to look at that data and value it.
In situations where it is not beneficial to the patient, we should not allow it to be re-identified. And there’s really no guidance that says we cannot have that flexibility, but I think you have to look at it from a —
(Lost audio signal for one minute)
MS. KLOSS: Thank you. Jeptha.
DR. CURTIS: Thank you. The purpose of this discussion is really to level set the group as to the perspective that I am bringing to this organization. I think it’s actually nice that we have three very different perspectives coming to bear, and I think the conversation to follow will be where the actual meat of the discussion takes place.
What is the American College of Cardiology? It is a professional society that participates in a number of different activities aimed at lowering costs, improving healthcare and health outcomes. The NCDR is an organization that works within the American College of Cardiology and runs a number of registries that collect personal health information on a fairly large scale. It has grown significantly over a 12-year period of time from a single registry with eight participating hospitals to now 10 registries that are up and running that really run the gamut of all cardiovascular disease.
With that growth has come an expansion of the tools that we provide and the goals of the registry. We continue to provide or seek patient-centric care, allowing value-based reimbursement reporting tools. We’re increasingly global in our perspective with international participation in our registries. We provide a platform for both clinical trials as well as comparative effectiveness research, and conduct post-approval studies from a number of different stakeholders.
Really at the heart what we are is a quality improvement organization that provides tools and mechanisms by which individual health systems and hospitals can deliver better care to their patients.
Kind of the informal motto of the NCDR is the following: Science tells us what we can do, guidelines what we should do, registries what we are actually doing. From that perspective, I think it really shows the importance of registries in delivering high-quality care on a reliable basis.
How do we acquire data? There are the inputs and the outputs. The inputs come from a variety of different sources including certified software vendors, homegrown systems. The ACC has increasingly invested in developing Web tools that allow hospitals to directly export their data and, more recently, working with the HR vendors to directly port data in.
The data is cleaned and then the outputs are really focused on the twin pillars of benchmark reporting, and this is used to inform quality improvement initiatives, but there is also, relevant to this discussion, a large component of clinical research.
How do we work? Again, we have comprehensive outcome reports that usually come out quarterly that benchmark individual hospitals and professionals to their peers and provide other measures and metrics including appropriate use criteria and risk-adjusted outcomes including mortality and readmission. We have registry dashboards that, on a real time basis, allow participants to drill down on their data and better refine their comparison groups so that they can have a better sense of where they’re performing compared to their peers.
As a byproduct of that, the NCDR now has more than 30 million HIPAA-compliant patient records. The question facing this organization, the ACC, on a daily basis is what can and should be done with this information.
From a research perspective the NCDR has a number of capacities and capabilities. Certainly the largest strength is in retrospective analyses. We leverage existing NCDR baseline patient data for prospective registries and analyses as well as retrospective analyses. In addition, there are increasing linkages with third party claims data, primarily CMS data, and for different groups we do study cohort modeling so that they can better design clinical trials a priori.
The research portfolio is robust. We have very important knowledge that has been generated as a byproduct of this registry, and I think this has impacted patient care in a very positive fashion. The concern is that we’re just scratching the surface. Most of our registries are restricted to a certain episode of care or in-hospital care, and it’s really not leveraging the potential use of this data to better inform knowledge and care.
What are the other research studies and the prospective registries? Some of them may be familiar to you. The ASCERT SYNTAX study linked registry data from our cath and angioplasty database with corresponding data from the Society of Thoracic Surgery to really better define for patients what is the best therapy for an individual — bypass surgery or multi-vessel angioplasty. That’s one of many examples of the ways that we’re improving care through the generation of new knowledge.
How do we justify using these patient records for research? Certainly, the utilization is guided by business associate agreements between participating centers as well as approved data analytic centers. These agreements allow the American College of Cardiology to use protected health information submitted through registries by means of limited datasets as well as, in a somewhat restricted fashion, de-identified data. This de-identified data is wholly owned by the ACCF and can be shared with external stakeholders without approval by sites or patients.
With regards to our approaches to de-identification of data, we have really only used the safe harbor method. As a rule-based approach it’s very clean and you can clearly define whether or not you are in compliance with the regulations. We have attempted to use expert opinion models in the past. As I think you heard from Jacki, there are theoretical benefits to the expert opinion model in terms of retaining certain direct identifiers in certain situations, but there is a lot of ambiguity in the execution of that and there’s a lot of associated cost and potential risk on the back end if your expert was wrong.
With regards to re-identification, we do not, nor do we seek to, have a mechanism by which individual patients could be identified in a de-identified dataset. That would be somewhat paradoxical. Nevertheless, as you have heard, there are certainly ways that individual patients could be identified in ways that were not really anticipated with the initial regulations.
The ACC and the NCDR have a multifaceted approach to reducing the risk of disclosure of PHI predominantly through data security policies and practices, data governance and oversight, internal audits and disclosure if and when disclosure — whether or not PHI was compromised.
What are the limitations of de-identification? Certainly, the standards are workable, particularly in the safe harbor model. However, the resulting data does not allow us to leverage the data to advance the science and medicine. We need better ability to follow patients over time and to better link to external databases that will enrich the clinical data that we already obtain. And we really need richer multidimensional data to serve society’s need for better evidence that can inform clinical practice.
Most of the work we do for registry research is not using de-identified information but, rather, limited datasets. These retain information such as age, dates of service and zip code, and use of the data in this regard is consistent with our data use agreements with sites. Analyses performed are performed within the ACC or at approved data analytic centers, as I noted, and the data are externally presented in the form of summary results.
Where would we like to be? I think this schematic identifies how we think the next generation of NCDR research should take place. We have both longitudinal as well as episode-of-care registries, and it would be great and I think very informative if we could conduct meaningful research by leveraging different databases so that we know, when a patient comes into our MI database, the action database, and then we can follow them in the outpatient Pinnacle registry and then re-identify them in the CATHPCI registry when they come in for an elective procedure and so on and so forth, such that we can really trace an individual and better understand at the aggregated level what is the best therapy for that individual and patients like them.
So what is needed? I think improved interoperability of data, both within the ACC — our registries need to talk to each other more effectively — but also with increasing consistency across entities that work with PHI. We need to have an enhanced ability to share data, distribute analysis wherein centers analyze data within their own datasets, and then share the aggregated results. There is an improvement but it’s unlikely ultimately, in our opinion, to deliver on the promise of big data.
And we do need clearer standards for handling PHI, governance, policies and processes on data security and guidance in putting these standards into practice. Thank you very much.
MS. KLOSS: Thank you so much. Next we will hear from Dr. Capps from the U.S. Department of the Census.
DR. CAPPS: Hello. I am Cavan Capps from the Census Bureau. My background is a lot of computer science, statistics at BLS, Census, Mathematica, Wharton Econometrics, et cetera. I was tasked about four years ago by Dr. Gross to start a lead on looking at big data for the Census Bureau. His opening was, frankly, that today, with all the transaction databases out there and the fact that at least from the Census perspective everything we used to do in paper we are now doing electronically — everything from buying things to sending mail to doing business transactions — he thought that we should start focusing on electronic data.
As a result, Census is working with big data. In fact, I worked with an NCHS project to collect emergency room electronic data, so electronic data becomes something that is affecting the entire commercial world as well as the healthcare environment.
As you probably know, Census collects data more than the Decennial. We do about 300 or 400 different surveys, so we deal with a lot of different types of information. But in the health arena, what’s very interesting is recently I actually had a health situation where I asked the doctors what were the longitudinal outcomes of the different procedures that they wanted to give? They didn’t have a clue. What they could tell me, literally, was how easy it was to cut me and how often they had problems during the surgery, but they had no idea of the longitudinal outcomes.
I said, I can’t believe that. I work at Census. You have all this data downstairs in your billing organizations and you have no idea what the outcomes are, and you’re in a network of 10 hospitals and you have no idea how to link the data together? So they invited me to talk to their researchers later, and the researchers said, yes, we have a horrible time getting data. Part of it is because of HIPAA and part of it is that when they got data, they got data from a hospital outside of their network, and they got a dataset with 25 observations, and I thought that was ridiculous.
So, today we are faced with a situation dilemma where we need — and this is where Census is at. We need more and more data to be relevant. And in the health arena, when you’re talking about outcomes, talking about precision medicine, talking about genealogical issues, you’re talking about more and more granularity than you have ever had before, so 25 observations really doesn’t give us the power we need to do our analysis. We need to be able to link this data together and we need to be able to move ahead in a jump in terms of medicine so that we see what the side effects are and the quality outcomes.
One of the things that was just discussed is we were working with NCHS to potentially link emergency room data with mortality data for the first time to find out, if you didn’t come back to the emergency room, whether you died, because nobody had any idea before this. So, linking data is becoming more and more a critical part of determining quality.
We have historic opportunities to basically improve the quality and reduce the cost of medicine, but to do that we’re going to have to connect the dots. We are going to have to be able to think about seeing the relationships between people’s genetics, their behavior and between the different treatments that they have had, which makes it that much more difficult to deal with this.
Privacy and big data are the collision. Because of these secondary datasets that Vitaly talked about, it’s easier to match things to secondary attributes. It’s not just the identifiers anymore. And, of course, Census is concerned about doing research on it because we’re concerned about protecting our datasets and being able to release more.
We have come up with a proposal — only a proposal — about a common privacy data architecture. It basically assumes something that Vitaly or somebody else might challenge which would make it interesting, which is that the way you break privacy is by taking another dataset and matching it to another that has identifiers on it. So, if my dataset has no identifiers on it, you can’t identify anybody as long as you can never get outside of that dataset. But if I can take another dataset that has secondary attributes on it and it has names or addresses, eventually I’ll get back and re-identify something.
Part of this idea is to create essentially a way of computing where you don’t keep the identifiers with the data. This is the way we do it at Census and this is the way I think several private companies do it. The advantage of not keeping the identifiers with the data and keeping it encrypted separately and connecting with a pseudo-ID is the fact that you also protect against identity theft. If somebody hacks our database they may get a Visa card number but they’re never going to find out who it belongs to. This is just good practice.
The second thing is the idea of provenance — where the data came from, what are the commitments you made to patients, for example. And, finally, making sure that all the data that’s operated on in the system, if you do an analysis on the system, is logged, and that no one can edit that log including DHS. If you edit the log, you break the log. That’s cryptographically possible.
The result is, all of a sudden, for the first time, we’ll have something that will allow us to do privacy auditing much like we currently have financial auditors coming in. You’ll be able to say to your vendors and sub-vendors did you use this data appropriately. If not, let me look at the log. You can hire somebody to go through those logs and audit them. This means that you could use this across a whole set of different situations.
When we talk about provenance, there is no consensus on what we want for provenance, but basically, when you look at the HL7 provenance information, it’s about an individual patient record. What would be nice is if we had a provenance much like Jacki was talking about that said this is what the patient agreed to have their patient records used for. We are going to allow it to be used for research now.
What does research mean? We don’t have a control vocabulary to talk about, but it would be nice to be able to say that when I go into the doctor’s office I agree to have my records used for research, and I’m assuming that means they’re not going to have any of my PI on it. They will only use it in groups; they’ll only use it for estimation analysis. It would be nice if it was machine-readable and human-readable, much like the HL7.
It would also be very nice, particularly when you’re talking about business agreements, et cetera, if we had a chain of custody like any other provenance. Anybody who touches that data, it would be nice to have their fingerprints on it and know what they did to the data. If they did transformations on the data we would like to know what the transformations were. We’d like to know who did it, and we would like to be able to make sure that you couldn’t change that.
The other point would be to digitally fingerprint this provenance data so that I don’t go in and edit it later, and so the subcontractor doesn’t edit it and say I’m not going to tell you about it. I can compare my provenance to the data. The idea is to have actually a digital fingerprint to the data.
If we have the PHI it would be nice to keep that separately and encrypt it. We can link it with a pseudo-ID. The issue is how you want to do that. I don’t know if you want to have a universal pseudo-ID between institutions because all of a sudden it’s not a pseudo-ID anymore. But down the road it would be very nice to be able to link institutional data together.
Why can’t we do distributive data processing? Why do we have to have one hospital? Why do we have our administrative records in this database? We spend millions of dollars to dump out an extract, a repository, to this other database. When I come in to a doctor, why can’t my data already be available for doing analysis on it with the PHI stripped? And why don’t I have transactions to say if I want to talk to my doctor and re-identify it? Why don’t I have an automatic method to re-associate my PI with it and to log that in a way that people know that we logged it for a particular reason? So it’s a glass box. It’s not a black box; it’s a glass box with transactions that can be audited.
The nice thing about this, too, is if you do extractions it would be very nice if — Currently, we’re doing extractions and you lose all the provenance. When we’re getting records at Census Bureau, buying it from data aggregators, we don’t know where the data came from. It’s all private sector, secret sauce. Oftentimes, we find the same bad records in multiple aggregated datasets. What they’re doing is buying each other’s data and mashing it all together.
It would be very nice if we had a situation where you had a system set up and probably regulations that support it that say when I do an extract, I get the provenance with that for the dataset. Maybe even link it back to the founding dataset. And now, whenever that goes around we have a fingerprint with it, and that’s part of the data.
Finally, what’s really important to make this happen is we currently have logging all over the place, but if I’m going to hack a system — you know, part of the IT security thing is learning how to hack a system — the first thing I’m going to do is go into your logs and edit the logs. If somebody, for enforcement purposes, comes in and looks at your data, the first thing they’re going to do is change the logs so that you’ll never know they are there.
As they say in the computer world right now, there are two types of systems. One is the group of systems that know they have been hacked, and the second group of systems that don’t know they have been hacked. So, IT security is increasingly focusing on not just stopping the hack but trying to figure out if you’ve been hacked already.
It would be very nice to have logs that cryptographically could not be edited but could be added to. If we did that, they would automatically break and you would know. If somebody wanted to go in and look at your privacy data for law enforcement purposes, they could do it. You can’t force that they go through procedures but you can assume they did. You can have a log that said these guys came in for this reason and it’s unalterable. Again, the combination allows you to create an independent privacy audit.
There are three potential models that we’re looking at for analyzing data. One is an enclave, a sandbox, that might allow people to build interfaces to it for a model server. If this terminology is obscure, please ask me about it.
Essentially, what this would allow us to do is allow people to come into a locked room and they would use VDI — a secure thing — and they could go into the confidential data. The PII would be stripped off. I think it would be interesting to do analysis statistically to see can we do what we need to do without ever looking at individual records. Can we do the research we need to do?
Now, this can’t be done for precision medicine, but for statistical purposes you could look at distributions of data. If I’m looking for correlations, I can look for correlations without actually identifying anybody. Even so, you would be within a room; you would not be allowed to bring data into that room, so you could not take a secondary dataset, match it to the attributes and then figure out somebody.
Essentially what we’re doing now is saying we’re going to trust people in some way but we can make it more secure so the level of trust and the cost to get that trust should come down, so we can actually do more work. Any of the outputs of this research should go through a formal privacy filter. We have been exploring differential privacy, and if we can find algorithms and techniques to do that, that would make a lot of the research cheaper to do and more effective in terms of its output.
A second methodology might be to use formal privacy techniques and create a synthetic dataset potentially using differential privacy. Differential privacy has issues. We should think about those issues. You have a privacy budget. You add noise to data, and certain data is going to be released with more accuracy than other data. You have to start making a decision about that data about where the value in the dataset is. What are the most valuable elements of the data?
You also have to start thinking for the first time about harm. In the statistical community I am unaware of real analysis. We have had the luxury of not really dealing with what’s harmful and what is valuable in terms of data. We put the data out there, and we think the more people who use the data and draw conclusions from the data the better. If we’re going to move to a new model we may say we’re going to have to limit these people from doing analysis because these people got there first. That’s something we have to think about.
We haven’t thought about synthetic data. We have done experiments with it, but there hasn’t been a standardized formal vetting of the data. I think it’s great to have datasets released to the public to use, particularly for the universities so that young students can learn how to do analysis and oftentimes they find things that we didn’t — I would actually like to have microdata. Some guys at University of Pennsylvania just took some data from Census, health data, and found interesting things about depression among certain demographic groups that the Census bureaucrats didn’t think to look for.
We’re not going to find everything. We need other people to do analysis on our data, so it would be nice to do that. But to do that, if we’re going to release these datasets, we have got to be thoughtful and systematic about evaluating these datasets’ release and, so far, I don’t know that we have done that.
The third model which I find particularly interesting and it would be great to think about in a health environment, is secure multi-party computing. We are exploring this at Census. Our problem is that we do an economic census; we do it once every five years, and we don’t have any, for example, supply chain data in the United States. During the last recession we had no idea, if General Motors went down, how many businesses in which other states would be affected, who would be laid off. We could have gone into a bar and argued about it and that would be about as good as our analysis would have been because we had data that was seven to five years old.
In an electronic world, why are we asking people to fill out surveys and spending hours to recall surveys when we should be able to get their email and cc it to a statistical agency and do the bean counts on the fly. Why can’t we get near real-time data?
Well, one reason is because the companies don’t trust us. We have a privacy concern with companies. If we ask Walmart, give us your transactions for the last five years and put a clerk down in your basement and spend three months filling out forms, that’s one thing. But if we say we want your real-time customers, your real-time suppliers, and we want to know exactly what your cost is, and you have, let’s say, state actors in other countries wanting to hack that system, when you consider the swift banking and wire transactions that have been hacked — J.P. Morgan has been hacked — a lot of these business are saying we don’t trust you with our data.
So, our exploration is moving to saying what if we put a computer at Walmart and read your database and we kept it all encrypted so if they hacked it, they couldn’t get it? Wouldn’t it be neat if we could take Walmart and Target — let’s say the car companies, take Honda, Hyundai, General Motors and Ford, and say any of the corporate sponsors could run aggregate statistics on this on a weekly basis so you, Mr. Car Company, know how well your business is doing but you don’t know how well the overall car business is doing. So now you are able to see how well you are doing, and you can’t back out your data and figure out your competitor.
We are interested in exploring this. Think about putting that in the health environment. Wouldn’t it be cool to say, now, hooked up to each of your HRS systems, you basically have something that’s pulling from your database and putting it into an analytical database that’s encrypted and allows people to do distributive computations for certain classes of operations, certain types of regressions, for example, certain types of tabulations across a set of machines. Now, that doesn’t deal with precision medicine, but what it does open is the potential for saying we’re doing distributive processing. We don’t have to pull stuff into one machine and spend lots of money to do that and clean it. We’re actually moving stuff that we’re trying to clean on a constant basis.
It would also be interesting because if you start linking the data together and start finding inconsistencies in the data, you can actually have cleaner data on a regular basis. This is a big, outside-of-the-box thought, not something you could do immediately, but the technology is increasingly tipping into the arena where that’s interesting to look at.
I’m ready for questions.
MS. KLOSS: We are going to adjust our agenda a little bit and ask Dr. Erlich to provide his testimony now. He has a scheduling issue, so he will join our Panel III and we’ll adjust our break accordingly, perhaps push it back a little bit so we have plenty of time for discussion.
DR. ERLICH: Thank you very much. I am very excited to be here. I’m a Professor of Computer Science at Columbia University and also a member at the Neuro-Genome Center. Before I became a geneticist I worked as a vulnerability researcher in a computer security company in Israel hacking into banks and credit card services, and then I switched to genetics. One of the main foci of my group is genetic privacy.
Kind of the focal point of my testimony is that it is hard to de-identify genomic datasets. We are now entering into the era of ubiquitous genomic information. We have initiatives such as the Precision Medicine Initiative by the President of sequencing one million genomes. We have similar initiatives worldwide including in the U.K. sequencing 100,000 genomes. The point of these initiatives is that we need to share genomic information.
The signals that we are searching in genetics are usually very weak signals with a very low signal-to-noise ratio; therefore, we need to pull information from a large number of genomes, tens of thousands of genomes usually, to start to make statistical inferences. Therefore, there is a strong push by the community to share these datasets. And we have also websites, such as The 1,000 Genomes by the NIH, where you can just download thousands of genomes without explicit identifiers but including all the genetic information of these individuals. So this is what happens in the clinical scientific world.
In addition, in the more citizen science, curiosity world, we have companies such as 23andMe, Ancestry DNA, Family Tree DNA that offer services to genotype a large number of markers on genomes of individuals. The service is really great. They send you either a swab or something to spit inside, you mail it to them, pay about $100 or $200, and they will genotype your entire genome after a few weeks and they will offer some sort of usually genealogical and ancestry analysis on your genome, plus you can also download your genome from their website as a text file. These companies have already collected between two to three million people, and most of them are from the U.S.
On top of what these companies offer, the genetic genealogy community, which is quite vibrant, also built their own tools and their own website to crowd-source information from people that were tested with these companies and created large databases. For example, here is YSearch.org. This is a database where males can submit their Y chromosome data together with their surname, and part of this database — there are another 170,000 records in this database documenting Y chromosome and surnames.
There is also a search mechanism that you can search the database without a subscription. This is on a government computer; I didn’t put any key. I could just search this database. This is for Y chromosome data.
This is what we have for crowd-sourcing genetic information. We have websites to crowd-source genealogy information. We have websites such as Gene.com where people can upload their entire family tree and they can also collaborate. They can basically take their family tree and let’s say that Barbara uploaded her tree with my tree. If we have a shared relative the website would offer us to merge the two trees together so we can create a larger tree.
My group downloaded the entire information from this website. We have 18 million records. We are able to build a single pedigree of 13 million individuals, which includes the President and also Kevin Bacon — they are related based on genealogy, apparently. So there is a vast amount of information out there online. We use these data not for re-identification but actually just to do some studies of longevity.
All these datasets, there are people, citizen scientists who are interested to basically re-identify themselves who don’t know if they are adoptees or people that were conceived by sperm donation. They go to these resources and they try to find their biological family.
I know it was presented yesterday that maybe re-identification is really hard. I invite you to come to the Facebook group called DNA Detectives. It has 16,000 people, and every day there is a success story of how you can use this website to identify individuals. And this is done by citizen scientists, by people without a formal background in biology or computer science.
Talking about all these very exciting arenas, my group four years ago, decided to test to see what is the status of genetic privacy from the 1,000 Genomes Project and all these great resources available for geneticists.
We focused on inferring surnames of individuals from their genome. How does it work? Consider that you have a couple, let’s say the Smith couple. If this couple has a child, a male, the father will give the child his Y chromosome and also his surname. When this child is married, he will also give his kid, in most cases, his Y chromosome and surname. This starts to make a correlation between surnames and the Y chromosome.
We found that, based on empirical evidence from 900 Y chromosomes that we knew the surname and reflects the use population, that there is about a 15 percent chance of recovering the surname of someone in the U.S. This is based on data from 2012, not the current status of the database. There are about 1,000 new records every month entered into this database.
After we have the surname, you might say, okay, you have a surname. There are billions of people named Smith, Johnson, Jackson, right? But we get not the most common surnames. We get in most cases surnames that are relatively rare, that are found in one in 5,000 individuals. That means that we can take our search space and go to about roughly 50,000 males once we have the surname. Then, we can take identifiers that are not protected by HIPAA such as the age and the state. If you have age, state and surname based on the Census data, we can get in most cases to about 12 males or less.
When we have 12 males, then I can now just call each one of them using social engineering, maybe someone without an accent like me can call each one of these people and ask them, did you participate in a genetic study, and get to this person. This is just based on simulations. Then we wanted to do a real test.
We took the genome of Greg Venter. We looked at the Y chromosome. You see the numbers; these are the markers from the genome of Greg Venter on YSearch. If I click Search on this Web page, after a few seconds you will see that I get Venter as a top match. I could go from the genome of Greg Venter to the surname Venter. If you want to replicate what I do over here, there is a link on this yellow note over there — you see you can do it on your own computer. Find the surname of Greg Venter from his Y chromosome.
After we had the surname for Greg Venter we knew that this person that lives in California was born in 1946. We went to USsearch.com, pulled his three identifiers, clicked Search, and found the record J. Greg Venter.
We did the same process for the genomes in the 1,000 Genomes Project and were able to identify close to 50 individuals in this case, linking their genomic information with the full identity and, in some cases, also Facebook photos and so on.
So what are the conclusions? I told you about the practical attack to identify DNA information. There is something crucial in this attack. We don’t need the same person to be in this database. We can have, if you’re a male, your third cousin in this database, put in his Y chromosome with surname, and this person will identify you, because information amplifies through these genealogical links. So, with a database of about 170,000 people we identify millions of people in the U.S. This problem, of course, is going to grow because people add more identifiers to the data.
Now, going to HIPAA. I have to say that I am a bit confused about the status of DNA information in safe harbor. It seems like Item 16, biometric identifier, lists fingerprint and photos and I think voice patterns as identifiers. DNA is not listed there. Item 18 says everything else that could identify a person, so I thought DNA means, based on safe harbor, you cannot put DNA information out there.
However, the NIH issued a policy two years ago about their genomic data-sharing policy. In this policy they said before you submit your genomic information to the NIH you should strip all the 18 identifiers. I submitted my response saying that basically it says submit in the dataset because DNA is an identifier, but it’s still in the policy. I think there is some confusion in the community what exactly it means — if DNA is part of this policy or not — and I think the committee will appreciate clarification of that.
I have to say also that the way the safe harbor works is you are allowed to put the age and the state, and these are — you cannot identify people just by age and state, for sure. But in genetics, we put information as part of pedigrees. Now we have the age and the state of three generations of individuals. This will allow us to narrow down the search space to get to a particular person.
You might ask how do you know. How can you connect? What is the database you are going to contrast your data? I can take you to genie.com or to Wikitree, and there we can search millions of individuals in the context of pedigree information to identify them.
So, to de-identify DNA information, these are just some techniques. I can talk more and we could go for another hour about other techniques.
We see now the advent of mobile DNA sequencers. Here is a DNA sequencer connected to my computer. It’s going to be a consumer device. We are going to allow people to sequence DNA without going to labs. I gave it to my students at Columbia University, and after 20 minutes they were able to start sequencing DNA of food and also of my own saliva.
Right now in my group we think we push to, instead of focusing on protecting the privacy of individuals, maybe we should in fact focus on building trust relationships with the community. We have a website we put together called DNA.land where we crowd-source these directly to consumer datasets, but not through a scientific study that is well-consented. In this consent form, we don’t promise genetic privacy but we tell individuals that they can delete their account anytime they want. I put my entire genome out there as part of the consent form to signal the community that while I am watching them they can watch my own genome.
We have this long-term relationship with the community really to facilitate trust, and this is the direction we think we should move because it’s very hard to protect the privacy of the information. Thank you.
MS. KLOSS: Thank you. Are we ready with questions?
DR. EVANS: I am Barbara Evans. I am a professor at the University of Houston Law Center. I’m a member of the Full Committee and of this Subcommittee, and I have no conflicts.
I wonder if you could enlighten us about the inter-dependency of privacy particularly with genetic data, and I don’t just mean where people know that they are family members, but just if you have a database that has a lot of people in it and some of them have consented to have their identifiers shared outside and others have not. Can you back into the identities of the people who did not consent by ruling out the identified ones? Could anyone explain how bad a problem is that, the inter-dependency?
DR. ERLICH: What I can do is I can show you my webpage in DNA.land. We have a feature that we search relatives. We have 20,000 people already in DNA.land. We launched it six months ago. I only have a few relatives in this database, third cousins. We have many third cousins and fourth cousins out there, and if some people decide to share their information by searching, now I know that this person is a third or fourth cousin of someone else. This reduces the search space for about a few hundred individuals on average, because this is the number of third and fourth cousins that I should have. Now you know my age, my state, and this further will narrow down the search space, and then you can find me.
These DNA detectives that I described use exactly this type of process to find their birth families. They go to the database of 23andme with 1.2 million genomes, and then they can identify some genealogical beacons and measure their genealogical distance to this beacon, to this person. They reduce the search space and then narrow down their biological family and who their mom and dad were.
DR. PHILLIPS: I am struck by some interesting differences, Jacki, between your presentation and Jeptha, yours, about the registry. I heard very clearly from you, Jacki, that whatever we recommend, hospitals are dealing with an issue where they have to choose either a shutdown data sharing or open it up with some faith — I’m not sure what that’s based on — because it’s very hard to manage data. You won’t increase the utility of it; you don’t have any audit capacity. Your contracting mechanisms are difficult to track and you don’t actually know who is touching all the data.
And then ACC with the registries has some really nice partnerships with hospitals around specific data elements, cardiology data, where you are one of those trusted vendors and are getting data and using it with some fidelity.
I’m not sure how to reconcile those, because there’s a lot more hospitals with lots of vendors who aren’t able to track or audit data trails.
How do we create trusted data agents that can help hospitals with this incredible conundrum? There may be policies that would let us do that. How do we create these trusted relationships where you can actually share data with some trust that it’s not going to come back to bite you?
MS. MONSON: One of the challenges actually with registries — I was smiling during his presentation because we have lots of registries — some are mandated by either CMS or others — which we absolutely have to participate in, and as a privacy officer sometimes I’m kicking and screaming as we sign that agreement because they don’t have good security controls around the data. They require more than limited datasets. He is not an example of one of those, but there are certainly many. It’s one of the challenges that we have. Oftentimes they don’t even want to sign our business associate agreement for use.
I think some guidance around registries would be really useful, and mandating certain privacy and security requirements would allow us to feel that they’re not only trusted but when I’m having that conversation with the patient, I can have that level of transparency and tell them, yes, we are sending your data but it’s secure, and this is how we’re maintaining your privacy within it. And at times, if they want to remove themselves, they have the opportunity to do that.
But I think it’s a growing concern that there are lots of organizations who want our data and there’s concern and want to protect our patients and protect that data. But I have many physicians who say we have to participate in these registries even if they are not mandated by some reimbursement or regulatory requirement. We do want to participate, but again, we have to have the security and privacy addressed.
Today there is no guidance, no policy that requires them to have certain security practices or privacy practices and no obligation for them to sign our agreement. So, oftentimes when we’re negotiating with them, it’s not what I would call a very good agreement where I feel comfortable with the security practices that are within it or the privacy requirements because we don’t have a choice. Especially with the mandated registries, they know that they have that flexibility because it’s a mandate, so they use that as power or leverage with the organizations, and that’s just not the right solution to, you know, the transparency with the patients and getting to the point where we can feel comfortable sharing data.
But I think your example of trusted venues for that we would definitely be open to and want to participate and share that data, and our researchers would be very happy.
DR. CURTIS: As a clinician working in a hospital system I am one of those people who goes to our privacy officer and fights those battles as to what is the return on investment versus the risk to the organization, and it can be quite heated. I would like to think that the ACC does a good job of being a steward of the privileged position that they have, particularly around the mandated registries, with a great deal of transparency as to what the security policies are around that, as well as what the data will be used for.
But I think it comes down, at the end of the day, to a certain amount of trust, and that may not be optimal but that’s probably the way that these relationships are built and maintained such that we, I think, provide a product that really is useful at the ground level so that it’s not a one-way dissemination of information; it is truly bi-directional and mutually beneficial.
MS. KLOSS: Vickie, then Helga, then Nick.
DR. MAYS: Thank you very much for presentations that had great suggestions and tried to begin to identify where some of the problems are. I appreciate it.
One of the things I would like to ask about is you talked about separating the data from the person when we talk about de-identification. It seems like in this process we need both together, so I’m trying to understand exactly what the approach would be and what it is that we really want to protect when you say keep them linked.
I guess my other one is — I think, Jacki, you talked about use cases and you would like to see some use cases similar to something that OCR has done. Could you discuss that a little more so I have a sense of that?
MS. MONSON: Sure. A comment to your first question — the idea of pseudonyms I think is great, but I can tell you that the practicality of it is that it’s going to create patient safety issues by not being able to link data together.
Before I was at Sutter Health I was at Mayo Clinic where they had a population of high profile patients where they actually used pseudonyms commonly to identify patients, and when I was there we got rid of it because it was creating so many patient safety issues. What happens if you have the same pseudonym, and how does that work if you’re sharing data across organizations? And if you have a universal ID of a pseudonym of sorts, it’s identifiable, so what is the difference between that and a patient name? I’m not sure there is a difference. That’s my perspective on that question.
On the second question related to use cases, there are a lot of common use cases around de-identification and de-identified data and ways that we want to share data for precision medicine, population health, innovative type things that I think would be really useful if we had scenarios and responses. You know, what is the regulator opinion on it? How does it fall within the scope of the 18 identifiers or not? And what does that look like? What do we need to do? I think that would be really useful.
He mentioned DNA and not really having any guidance on it. That would be a great example of something that guidance could be provided to be able to respond to that example so that we know in the industry we’re not constantly having to be concerned about this gray area and approaching these experts, with even fear of those experts not necessarily giving us the right advice. At the end of the day, we all want to do the right thing; we just need the guidance to be able to do that, and I think those cases would be really useful to be able to respond that.
I think it would also be useful — We’re going to continue to innovate. It changes every single day, and it would be useful to have a mechanism by which we could continue to submit use cases for evaluation that might not be currently available, that come up as we work towards different use cases that might today not be front and center.
DR. MAYS: You actually also brought up the issue about de-identification between data and the person, so I would love to hear you comment on that.
DR. SHMATIKOV: From my perspective, I think it’s important to develop techniques that make use of the data, that explicitly account for potential leakage of personal information. I think right now there are two communities, two lines of research that basically don’t intersect. There are people who look at data de-identification, they look at static datasets, they try to look at the attributes, try to decide which attributes are identifying and which are not and do something with identifying attributes. By the way, I think that’s kind of a losing game. That’s not going to happen because there are external sources of information. Like I said, every attribute becomes identifying at some point. But this is the de-identification part of it.
And then there’s another huge area where people look at how to make use of the data. They develop techniques or understanding correlations, using this data for precision medicine and also for computing general statistics and making some biological use of it, research studies. These two lines of inquiry, if you will, seem to be completely separated from each other, and that is not how it ought to be.
People who design techniques for understanding the data need to think about privacy and protecting individuals from the beginning, and that has to be part of the technology that they use. At the same time, people who work on protecting the data need to think about the uses of the data; otherwise, they are going to end up with datasets that are not really useful.
So there needs to be more conversation between these two communities and some kind of integrated approach that takes both aspects, quality of the outcomes and extracting useful intelligence from the data and safety of the patients from whom the data comes. They need to be considered holistically, and systems and algorithms need to be designed so there is some balance between the two.
DR. RIPPEN: I want to thank all the speakers for very informative, very interesting concepts. All touch on each other in very different ways, so I want to thank you.
I guess I would like to highlight something that I always find important from a bias perspective which is the question of transparency. Transparency actually enables us all a lot of times to step up with responsibility and knowing that someone might be able to see something or understand what we’re doing, which relates to provenance as far as who has done what to what information.
So, when I hear everyone talk about everybody has data, is collecting more data, is adding more data, the question of re-identification and de-identification becomes more and more important. Even questions about who owns the data ultimately, in what form, the decisions people make about when to release de-identified data or not and to whom and for what purposes, because we know that then HIPAA is no longer relevant. So a lot of really interesting concepts all coming together.
I would like to ask each of you to think about what would the implications be if we enabled transparency by requiring provenance? Every time data gets exchanged for research or whatever, that there is some sort of a record as it relates to what was done to the data, and that includes de-identification and addition of other data, what are the implications of that, and what do you think might happen, both good and bad?
DR. CAPPS: I do know that when we tried to get data for CDC, we couldn’t get it because hospitals wouldn’t deal with us unless we had a business agreement. Even though HIPAA explicitly says CDC has a right to the data, the reality is all the hospitals basically are used to doing so many business agreements that they won’t share data without a business agreement. We tried to explain that any contract the federal government signed with them would be unenforceable. So there is a whole culture of — this is agreement.
I also went to a conference where there was a company in California whose whole business plan was to be a business associate with all the hospitals around, and then they would sell access to the data to places like Safeway so that Safeway’s pharmaceutical people could target people individually about what kind of diet they should have and what kind of medical use.
I can only talk about provenance from the statistical agency perspective. I think it would be interesting to see exactly where the business agreements are going and how many people are using them because I have no idea whether provenance or some kind of public index that simply says — I log in and I say this is how my records have been used, or some way that watch star groups could come in and look at aggregate uses of data and see how it has been used. That might be possible. It may be too onerous because these business agreements I think are somewhat non-transparent, and I don’t know exactly where they have been used.
DR. RIPPEN: I guess, just to clarify, there’s provenance of the business agreements but also kind of the tag to the data, also.
DR. CAPPS: Right.
DR. SHMATIKOV: Let me just add one quick thing here. Transparency is a wonderful thing, and in all kinds of data use, whether biomedical or not, we need it. But there is a huge technological problem here.
Tracking data — there is such a mish-mash of systems out there, and as data moves from system to system and tagging it in a way that survives transition of data from one place to another, that’s a problem that basically has not been solved technologically. I’m sure people who work in big hospital systems probably know how painful it is, and I can only imagine how awful it is when there are multiple databases and multiple systems that use data from these databases and data flows from place to place in unexpected ways.
DR. CAPPS: Again, working with the different EHR companies, there’s a large incentive for these systems not to work together. Each of the vendors wants you to buy only their equipment, and they only want their stuff to talk to each other. There’s a conscious effort not to have the stuff inter-operate.
The reality is — we have actually talked about reaching out to — I’m not going to name the companies, but some of these companies, and thankfully somebody — maybe it was this committee, maybe it was somebody else — pushed Meaningful Use through, so that there are some incentives to actually have some of the data begin to become interoperable. It’s not something the hospitals have any control over. It’s something that seven big EHR vendors have.
I think there are things that you can do or the industry can do to move toward figuring out inter-operability. If you want things to connect, it’s the protocols. You can make any kind of electrical instrument you want as long as it uses 110. As long as the connectivity works and you force that connectivity, you can force systems to become standard, because if they want to play they have to be able to communicate.
DR. ERLICH: I totally agree with what Vitaly said. When you move data it’s very hard to trace what is going on. But we also see the opposite trend. We see moving computing to the data. We see that there is like a centralized common on the cloud and then researchers, basically for free or they pay, process the data on this common without moving large datasets.
I think in these cases it’s much easier to have transparency and log the commands, and the API calls the researchers down on the data.
MS. MONSON: I think conceptually it’s a great idea. I think in reality it is not going to work, and I can give you an example of why. The disclosure regulations under HIPAA at one time had a right of access that is currently being evaluated, and at both organizations that I’ve worked at, Mayo Clinic and Sutter Health, we ran audits on non-complex patients just in the designated record set, so those systems that were used for healthcare decision-making or billing purposes, and it was 17,000 pages and it wasn’t readable, wasn’t legible. That’s one of the challenges.
Sutter Health has one of the more sophisticated privacy audit programs where we run audits on our records and look for inappropriate access, and we constantly are challenged with our vendors who don’t have good audit logs and are battling with them to ask for good audit logs. We have multiple systems, over 2,000 systems that fall in that designated record set. As technology evolves, obviously, we will have less of those systems, but to try to track all of that in those systems is just not possible to the way that we would love to do that.
For research we do keep track, but it’s very manual and cumbersome. So I think in concept it’s a great idea, but the technology just hasn’t caught up with where we want to go in keeping track of it. I think, instead, we have to look at transparency upfront of how we would use their data so that we can share it in advance versus being able to say these are the exact ways that we have used your data. We just don’t have the mechanism to track.
DR. RIPPEN: I’ll just add a little aside. Technology is an interesting thing and it changes dramatically, especially based on what the approach is, like analytics coming to datasets. Also, there are a lot of other industries that have to track a lot of things around the world.
Again, I just throw out that there are some interesting things that might be possible; it’s just a question of what is the best thing to do that allows us to have the balance. Thank you very much.
MS. MONSON: I can tell you that the patients would love that data because they ask for it every day.
MR. COUSSOULE: I think we tend to get bound up a little bit in some of these discussions because they’re very broad. We tend to think about a one-size-fits-all kind of model in some of this. There are two questions I have or points to make that I would like you to explore a little bit.
One is there’s a distinction between what I call R&D to figure out best case, long-term things, and very directed things to an individual person or patient or something like that. But you’re thinking about both of them from a health system perspective. Most of the rest of us are not thinking about how do I help Joe Smith right now, but more about how do I help people like this with this kind of thing. So there’s a little bit of a distinction between those that I’d like to explore a little bit.
And the second thing is I tend to think a little more about the intended use kind of questions — not necessarily the technology components but what’s the intended use, and do we think about policy in regards to the intended use question as opposed to the data question. I’d like you to just weight in on that a little bit if you could, both of those things.
DR. CAPPS: I appreciate the issue that you talked about — for example, pseudo-IDs. If you’re in the hospital and somebody is looking at your arm band, they have obviously a scanner number. They are not using your name to identify you; they use a scanner number. Then they ask you questions, PII questions, to validate that your scanner ID is correct. That does talk to safety and it does talk to being able to ask somebody about their genetics, for example.
But if I’m doing research, I’m going across sets of things. I don’t necessarily need to have that data directly connected because I’m not really asking questions of your name. I’m asking questions of your heart attack, or of your predisposition to something.
We’ve talked about the idea of provenance and said, okay, what would be the intent of the use. My problem with that is we don’t even have a map; we don’t have a population of uses. And the other problem is new uses are coming up all the time. Whether we come up with a registry or index of uses and could evolve that — I agree with you. Anything that we build in technology that becomes static, becomes brittle like glass, will be broken very quickly.
If we don’t build in change so that technology evolves gracefully — I don’t want to have to go around my elbow to fix something that was done 10 years ago. I want to be able to make it fit in with something. So we have to design our technology so it evolves organically, gracefully, to do things that we didn’t plan.
To do that often means let’s think broadly and say the kinds of research, the kinds of tech we want to use. Maybe we’re not using it now, but in the foundations. Let’s don’t tear down the building and dig up the foundation again every time we want to put a door in. Let’s build the infrastructure in at the foundation level to handle intentions that we may not have today.
The second thing is I think we need to really think broadly about the kind of intentions that we may have, particularly when you’re talking about new uses like precision medicine. What if all of a sudden we have genetics and you’re going to the doctor and he’s pulling up your genetics and saying, okay, I crossed and there are three million people — let’s say 100 million people — we know about who have the same kind of thing. These are the kinds of things we’re looking for.
So it may be that we’re looking at the cloud. Eventually, the cloud and the local machine may be irrelevant. You run the same software whether you’re running it at your actual site or you’re dialing in to a private cloud somewhere. The distinction between the cloud and a private cloud is going to largely evaporate, and the connectivity between different clouds is going to be connected. I can see that it won’t be just your 700 systems anymore. Your 700 systems may not be worth anything. They may be obsolete if you don’t connect to these other clouds of systems so that you can get information from them as well.
So these issues of de-identification or privacy — information traveling privately may be what you’re looking to do, and being able to connect the privacy information just in time at the right time may be what you’re trying to do. Again, that’s a broader thing, but the distinction is that what you have the ability to do right now is to start influencing the incentives for building — these technologies are here.
The question is are we implementing them, and the answer is no. And the reason why is there are too many reasons not to. There’s more money to be made. If I was in business and I had 80 percent of the market or 30 percent of the market, I would not have my equipment talking to all the other equipment, either; I would be talking to my own systems.
So, how do you influence interoperable standards so that things can be plugged together? How do you build incentives into business so that they want to build things that interoperate, so there’s more money in talking together than not talking together?
DR. ERLICH: In genetics, the division between research and data with clinical implications is getting more blurred and more mixed. When my colleague and I work on cases of Mendelian disorders, we want to publish the results as a basic research and exchange information with other researchers, but there are also implications of the data. The families want to see the data because it can help them and their future kids.
When we look at complex traits, sometimes we do a study on heart disease and then we have incidental findings. Suddenly we have people that are carriers of BRCA-1 mutation. The recommendation of the American College of Genetic Medicine is to return these results to these individuals. Although the study was about heart disease, now we should report about their cancer status. So it’s getting more blurred. I don’t think in genetics we have such a strong distinction between just R&D and the clinical care.
MS. MONSON: I think it really has to be a layered approach. I think part of the challenge with the de-identification standard now is it is as black and white as it can be with some gray area, and I think it needs to be more fluid than that to address innovation and be layered. How is this going to benefit our patients? And then have an evaluation of a set of standards based on that, and then a decision about really whether it needs to be de-identified or not.
I think if you do that with transparency, you are going to achieve what we’re looking to achieve, which is being able to share the data and use it at the end of the day to benefit all of us in the room and everybody else with the data being able to be used for medical research and other things of that nature.
DR. CURTIS: I think there is also a larger conversation that goes well outside of this room, which is what does society want, what does society value, and how much heterogeneity of desire can we accommodate when we’re still trying to respect the privacy of the individuals while pushing forward the science and medicine and healthcare. They are both incredibly important and, as we talked about, are going to be running into each other in an increasing fashion.
I think communicating, doing a better job of communicating to patients on a regular basis what we’re doing with this data, why we are doing it and what can be achieved with it benefits not only the individual but society as a whole, and is kind of a discussion that doesn’t really take place. It certainly doesn’t take place at the bedside when I’m with patients but I think really is one that is necessary moving forward.
I do like the concept of the opt-in and opt-out, approaching tiered approaches to how patients might try to control how their data is being used by external parties. I think that certainly would resonate with me. I only speak for one person in this case. How you execute that is a massive challenge, but I certainly think that patients deserve to know what’s going on with their data.
MS. MILAM: I’m wondering if you all could speak to navigating all of the different standards of de-identification that may be imposed on a dataset. Of course, we have HIPAA but we could also have Part 2. We could have state law with other de-ID standards.
And layered with that, when we think of a record, a patient may present at the hospital for a heart attack but in position 15 of the record you could have a mental health diagnosis, so when does it become a substance abuse record? When it does become mental health, HIV, STD or anything else? Is there any approach other than going to the most stringent standard? Are there ways to slice and dice this? Are there any best practices when you have multiple datasets impacting the data in different ways?
MS. MONSON: From a provider perspective, it’s extremely challenging. Sutter Health is located in California, which is probably one of the most heavily regulated states as it relates to privacy, and we are constantly evaluating HIPAA, state law and how does this work. It frustrates our researchers because oftentimes it’s hours of conversations and evaluations. I have to know where they’re going to go, and sometimes they don’t know at the beginning of the research study exactly where it’s going to land at the end because it’s going to depend on the value of where the data is and what they might find.
So we are constantly having to evaluate and it’s challenging. Oftentimes, they leave frustrated and say this data is not going to be usable by the time we get done removing all of the requirements.
I mentioned earlier the HIV status. There’s a lot of diagnoses related to mental health, substance abuse and sensitive diagnoses that have certain restrictions around them that provide real challenges for research purposes because it’s not practical for us to get authorizations from 5,000 individuals where we just want to use their HIV status. But, in reality, that is a regulatory requirement so, in fact, we have to do that and we do it. But oftentimes that will steer them away from doing research to include that status because of those challenges.
We definitely go to the most stringent regulation, but I can tell you that it’s not practical and I would love evaluation not just at the federal level but at the state level to see how they challenge each other and what that looks like, because I think it just becomes more challenging. California is constantly — has 20 to 30 privacy laws that are on evaluation in any one year, and all that does is make the data more challenging to be able to use.
We obviously, as I mentioned, want to be transparent but it’s impossible to get written authorization for every single one of those special diagnoses and other challenges. Oftentimes when I’m evaluating, HIPAA is the least stringent of all the requirements.
DR. SHMATIKOV: Let me now make a hard problem even harder. Now suppose the patient shows up and in addition to all of this explicit stuff you talked about, like maybe they have a heart condition but there’s some indication of a special protected category like mental health, suppose there is also genetic information in the record, which is increasingly going to be the case.
Now you don’t know what’s sensitive in it. We know some things about human genetics but very little yet. As time goes on we’ll know more and more. And genetic information is immutable; it’s going to stay with that person for the remainder of their life, and it also impacts their relatives, and we just don’t know what’s sensitive in it.
Maybe there’s going to be some scientific discovery in a few years that now, looking back at this genetic information, we can see all kinds of information that we don’t know now and we realize that, oh, that should have been protected because it indicates something.
You’re not going to solve that, because we have information where we don’t know its value, we don’t know what’s sensitive about it, but it has implications over decades for the rest of the person’s life. It impacts their relatives who are not even part of that record or not in the hospital at the moment. I don’t have an answer.
DR. ERLICH: And to add to that, and maybe it is outside of the mandate of this committee, but it seems that regulating protection from harm was much more useful for genetic information, such as the success of GINA. Last year we saw that even when a company tested 13 different markers that were totally benign, not related to health, still they faced severe penalties of $2.25 million for testing their employees, which is prohibited by GINA.
So I think that regulating or protecting these harms may be a valuable path forward, and not the privacy, which is increasingly harder and harder.
DR. CURTIS: Just to answer your question earlier, I think, yes, the varied regulations that people face are barriers to moving forward with this type of inquiry, and it would be great to have a single consistent standard nationwide. I don’t necessarily see that happening anytime in the near future.
MS. SANCHES: Thank you for all your comments. This has been a very interesting morning.
The safe harbor method really doesn’t take into account the expected uses; that’s why it’s called safe harbor. The expectation at the time of crafting the rule was that the expert determination method would probably be the preferred method for doing longitudinal analysis or any cases where you really did want to vary the data elements that you were including in your dataset to support your research.
Other than providing use cases, which I think is a very interesting idea for guidance, what could be done to make the expert determination method a more viable approach?
MS. MONSON: From my perspective, the idea of technology that could do that, which would actually meet the regulatory requirement or have an endorsement from the regulators would be really useful because from a cost standpoint it’s probably the most cost-effect mechanism, assuming that technology is affordable.
The experts — it’s just really expensive to have them evaluate it. Oftentimes we can get them to evaluate it and get to a point where we can use the data the way the researchers would like to, but it costs upwards of $200,000 for evaluation and sometimes takes three to six months. It gets costly, and when the researchers are evaluating what their research budget is, sometimes they’ll drop the evaluation of the research because it just isn’t practical to do that.
As far as I know, there are like five experts in the nation that do this that we would actually trust, which is not enough, which means supply and demand. They can charge more based on the fact that they know we need them to provide useful data. I’m sure I am not the only organization that’s doing lots of research that runs into these challenges.
It’s a great idea in concept. I think just the practicality of it is really expensive. I really like the idea of technology where you could dump the data in it and it will achieve the same thing as that expert.
DR. ERLICH: If I could comment on the expert model, there are cases — again, going back to genetics — where the experts just don’t know enough because it’s such a new field. When the genome of Jim Watson was published, he decided he doesn’t want to disclose his apo E status because of risk of Alzheimer’s. So, working with Jim Watson — he’s an expert in DNA, obviously; he works with geneticists at Baylor — they decided to just take the part around his apo E gene and redact it from the data. A year later, a group from Australia showed that they can impute back this part using genetic imputation.
So here’s a method that is not related directly to the expert provision in HIPAA, but this field is moving so fast it’s hard to be an expert and have some statement that will stay for a long time.
DR. MAYS: I just want to make sure I understand something in terms of the privacy issues relative to machine learning and algorithms. I have negotiated contracts for my own and I think I understand it, but there may be other things that I don’t know about it.
What is your concern specifically about machine learning, the data for machine learning?
DR. SHMATIKOV: Again, this is a very new area of research, so a lot of things I can talk about are not yet — they are public knowledge but people don’t appreciate their significance yet.
For example, it’s very easy with existing machine learning algorithms and machine learning platforms to use data, create a model, and then somebody can take the model and figure out what the training data was that was used to create that model. So the model leaks the data. Oops!
It’s avoidable if you do learning right. I don’t want to go into technical details there, but that’s a very specific thing that could definitely happen.
There are more subtle issues. Unless you do machine learning on the data carefully you can easily end up with things like disparate impact — different categories of people are affected differently and the algorithms end up being biased. That’s a huge issue that a lot of people are looking at. Transparency would help there because it could show how different attributes are used in the algorithm.
I could probably talk about this for an hour which I am not going to do. But this is a very poorly understood area, and we are just now starting to understand the risks that come from just blindly throwing machine learning algorithms at the data and just using the models as they come.
DR. MAYS: One of the things you’re doing is encouraging people to put their genetic data up and then say, okay, at any point you should feel free to remove it. What I’m trying to understand is how much education do people need to understand truly what the consequences might be?
I think what happens is — and I’m particularly concerned about some of these vulnerable populations — they think they’re getting a lot, and then when they see the consequences — and for people who are lower socioeconomic the consequences are greater to them, and those are often the data that we seek to use for things. Then we have people who have great mistrust of the healthcare system.
So I’m trying to understand what should they be told in order to know what is really an unintended consequence maybe of their data being that available?
DR. ERLICH: We have several layers to encourage people, the participation of people. The first thing is reciprocation. If you come and bring your genome to DNA.land, we will do some fun analyses, or useful analyses, for a genome. We’ll tell you about your ancestry, search for relatives. You have also to opt in to see this information so it’s not that we’re pushing this information to you.
Our consent form is extremely simple. It’s 1500 words, plain English, annotated, and we don’t try to consent for — It’s only a one-time consent and that’s it. It just sets the basic principles of DNA.land. As they use the website and present more reports, more information, we ask them again — now we are going to ask you for more information. Do you want that or not? And we present also some content on what the meaning is of providing this information. So we don’t try to push all this information at once and then it’s not a consent.
We have a person in my group who comes three times a week just to answer emails of people, go to our Facebook page, address their concerns, and we try to engage them in these conversations. I don’t tell you what will be always the consequences — even me as a geneticist, I really don’t know.
I think as a society, we have this duty that if we follow principles that eliminate discrimination, eliminate racism, then we can encourage people to share their data. Being myself of Jewish heritage, I wouldn’t put my genome online 70 years ago if I lived in Europe, but I do it today in the United States because I feel that in this society I am much more safe. I think this is our duty in general as a society to encourage these principles.
I just want to touch on the machine learning aspect that you asked. We have sometimes some surprising results from machine learning. My colleagues at MIT developed an approach where they can take a movie from a regular cell phone, stable camera and good illumination, and see the heart rate of a person just by machine learning approach, analyzing the video. And here we expose health information, something that most people just don’t know. With a regular camera and just taking a video, right there I can see your heart rate.
So this is like some of the surprises that we see, and also there are many useful things to that but also some maybe unintended consequences. Might be creepy.
MS. MONSON: I think the other thing that’s beneficial is patient impact, to share with them the benefits.
I have found that the easiest way to do that is to actually share a patient story with the patient who I’m talking to and explain it that way. They are much more comfortable, and it also then provides them the benefits of opting in or authorizing particular uses. Otherwise, they think about it from the media, the news, which is focused on privacy and how the government and everybody is using their data all over the place without their permission, and that’s a misperception. The only way you can overcome that is to share the impact of how it could benefit them.
DR. EVANS: The question of privacy harms has come up, and could you help us understand is that something that can be a useful regulatory criterion? Is it sufficiently measurable that you can say this policy is superior to that one because of these? And can it be communicated as part of this transparency to the public? Could you just tell us a little more about whether that’s a vague concept or one that actually has utility here?
DR. CAPPS: Actually, I am asking that research be done on trying to find a way to define and quantify harms. In the area of privacy, I go through lots of anecdotes about how I can violate your privacy, and I’ve talked to people at Linked In and Facebook that said you don’t have privacy; you’re not going to have it; get over it. At Census we’re concerned about privacy no matter what.
But in the days when we can say we can de-identify a dataset, and if you can de-identify anybody in the dataset at all out of 300,000 people the dataset is broken — that maybe was something in the past.
Today, we’re talking about either benefits for patients or we’re talking about prospective harms. We can all give use cases, anecdotes, but there isn’t any index or anything that says these are the kinds of things. Do I say that I want my family — When I communicate with somebody, I’m going to want to have their PHI, whether they’re in a hospital or you’re giving genetic counseling or anything else. You’re going to have to link that PHI to it. But if you’re doing research you don’t need it.
So the question is can you find practical ways of separating it, and can we have ways of indexing or having people talk about harm and then judge it over time? Because new harms are going to happen, and we need not just a list of 18 identifiers; we need a way of judging this and saying these are more important and these are less important based upon some criteria. And right now, we don’t have that criteria. We don’t even have a list.
So I’m calling for research if we can get to that, because I think in the world of privacy that’s going to be something that we’re going to need.
MS. MONSON: I think harm is quantifiable, and I think the research community already does that. Before something goes through an institutional review or IRB or review process, you identify as part of that what harm could result from it. So I do think there are ways that we can today quantify the risks and possible harm to patients based on the different types of research that we’re doing. I don’t think it’s going to be a one-size-fits-all, but I do think that today we can quantify that and, in fact, do quantify it for other evaluations as it relates to research and use of data.
DR. CURTIS: I kind of disagree with that. I think it’s hard to even quantify risk to a certain extent, but I don’t think you can quantify harm because it’s such a subjective term. What might be harmful to an individual might not be perceived as a slight by another, so I think it’s hard for a regulatory agency to take that approach. I think that outlining guiding principles is probably the extent of it. But I think quantifying risk, yes, is doable.
DR. RIPPEN: I want to actually expand on that. If what we’re talking about is not necessarily IRB-approved uses of information, especially once de-identified and no longer under any kind of or minimal protections — and now we think about determinants of health — If you looked at the IOM as far as what they listed out, if you want to talk about really sensitive information about exposure to alcohol and drug abuse within your family, violence, behavioral things that really shape an individual and how they may or may not respond. If you think about even what we may include in our electronic health records of rape or incest or even if you were a kid and you got thrown in jail because you were experimenting, and then what that means to an individual because certain societies view things differently, what is shameful and what is not.
So, when we start exploring kind of risk, this concept of harm, we have to ask what are the implications of even capturing everything in an electronic health record, and are we changing what we’re sharing or not. I remember treating a patient in the ED and it took him until the end before he shared something very, very private, which was the reason why he was there because he had to get to the point of trust.
Then you think about, well, now that data is being made available — I mean not for bad but for good, for improving health, but the intent is information that is electronic can flow. And what does that really mean? And then we de-identify and then people can actually pull it back together.
So I think we just have to be wise with regards to thinking about what is it that we’re really trying to do here with regards to de-identification and re-identification, going back to the question of transparency and intent that we talked about.
So I just wanted to reiterate this notion that you highlighted with harm. Harm is something important. And we all know in healthcare our role is to improve health and to work with individuals about something that is very personal and private. I just want to reinforce that.
MS. MILAM: I am still interested in exploring the harm discussion and its quantification. Prior to what I think of as still the newly revised breach rule, earlier, before we had the new rule, we would get into discussions of whether one patient would view something as a slight or not. But then, when the breach rule changed and there’s essentially a presumption of harm, and for there not to be harm you have to prove that there is a low risk, it really takes out a lot of discretion.
There are four factors in the rule and we have created a risk assessment tool where we assign numerical values, and we have buckets of different values associated with each of the four areas and you really don’t get into individual perception of harm. It’s more fact-based. So I’m wondering if we could use that sort of risk of harm analysis to apply to de-identification and maybe get some traction that way.
MS. MONSON: I think you could. I’m very familiar with the breach notification and being able to assess and evaluate that, both the new one and the old one where we had conversations and evaluations of that, and I do think that’s an easy way to evaluate, taking out the perception of individuals and instead focusing on the fact-based. We found that that breach analysis is actually pretty useful in that evaluation. Unfortunately, it did not change how many patients were notified. Oftentimes, we still notify patients even if there’s no harm because we feel it’s the right thing to do. So I definitely think that is a way or mechanism that you could evaluate in this way.
DR. ERLICH: I think we just don’t understand enough the consequences of harm, and it’s very hard to predict. An empirical example is the Ashley Madison leakage — 36 million records of the most intimate information someone could imagine — infidelity, sexual preferences — were leaked online. There was some discussion a year ago and I thought the skies were going to fall, really. I thought this was going to change the Internet. There was an op ed in The New York Times about how we’re going to see a spike in divorce rates, how we should actually all buy small apartments because there would be a higher demand for small apartments.
And now, a year later, yes, there are some sad stories, but after such a strong exposure I don’t think the skies fell –now there were some major consequences — but even though this is probably one of the most sensitive datasets you can imagine that leaked and many people are identifiable in this dataset.
So I think we still don’t understand exactly the psychology of harm and what exactly it means to people.
MS. MILAM: Do we need to separate harm into reputational harm or financial identity theft to make a clearer path?
DR. CAPPS: One thing that’s interesting with the whole privacy thing is that one day, people are sharing everything that’s most intimate on Facebook. The next day, somebody is sharing their genetics through Ancestry.com. The next day, there’s a newspaper article about identity theft and response rates to government surveys go down and it costs us $3 million extra to do the same survey. So people are responding to newspaper articles.
This is not a rational response to privacy; this is whatever we’re thinking about now we give extra focus to, and we over-respond. So you can see the response rates. Census doesn’t like to talk about privacy. We don’t like privacy to be discussed because it costs you a lot more money to get data when we do that. We’re very sensitive to it.
For me to talk about privacy harm as a non-subjective issue seems impractical. It seems to be subjective. Different people consider things to be — After 9/11 we were expected to give up information, small tabulations on where all the Arabs lived in the United States. Census shut that down; they just didn’t get those tabs. We actually made our tabs more gross. We actually cut back the amount of tabs we did.
But this is a moving target, and part of it, it seems to me, is we need a way to capture the signature of what people consider harmful. At the end of the day, what we’re talking about with provenance is trust. If we want to be effective in what our mission is we need trust, and for us privacy is a tool to achieve that trust.
Now, if we can come up with a regulation that people have trust in, maybe that suffices for what we need. Maybe if we have a way to communicate what that is, and maybe if we have a way to deal with these unexpected issues in the future, that’s useful.
I also remember working with a guy in South Carolina who was in the finance department and collected and linked all the records for mental healthcare. He had support from everybody because he linked together the crime records to the mental health records, and he had police support because they were basically planning the number of jail cells they needed by the number of kids who didn’t read by third grade. They knew the number of kids who weren’t going to read by third grade by the mental health problems and the criminal problems they had in the family.
Now, we need to connect those dots to be effective, and that violates privacy. So, part of it is we have to learn how to deal with connecting dots and being able to do this research at the same time we’re dealing with individuals. And when we deal with an individual they’re going to have PII or PHI associated with them. It’s not simple.
MS. MONSON: I agree with the trust comment. I think related to financial or reputational, they don’t know the difference. I’ll give you an easy example of why.
We’ll send a patient a notification letter that their information was inappropriately accessed, and I’ll get a call from a patient who says can you please provide me credit monitoring. I have tried, although I have stopped trying, to explain to them why it would have nothing to do with their credit and why credit monitoring would not be useful to them. However, they always, 100 percent of the time, disagree with me, so I stopped making that argument and instead just provide them the credit monitoring if that’s going to make them feel better.
But they don’t really understand, and mostly it’s because they’re influenced by the media and what the media is talking about.
MS. KLOSS: We are right on schedule. Thank you so much. Walter, did you have any questions?
DR. SUAREZ: I just want to say it has been a fascinating panel, thank you so much for all the testimony. I think there has been a lot of discussion about methodologies, techniques, ways in which protections can be added to the data to restrict or control the risk of re-identification.
I was intrigued by one of the concepts from testimony from the U.S. Census, which is a digital marker that can in some way either control or restrict or even alert of possible re-identification attempts. I wanted to see if there could be a little bit more explanation about it, if there is really some technical capability through tagging. Some people talked about some provenance elements and meta data access or even digitally signing or digitally marking data, which can indeed help either restrict or prevent re-identification when that is supposed to be restricted.
Is there such a technology truly available?
DR. CAPPS: This is experimental. I am sorry to say we don’t have software in place that we can deal with, but it’s something that we’re exploring. Think of it this way.
You don’t tag the individual records; you tag the dataset. Then what you do is effectively, in the logs — now we have machine learning. The advantage is that now once you have a log of all that, and let’s say you have a training set of some of your experts and non-experts trying to de-identify data, you’re going to see the types of operations they do on the data. Once you have the log, you now have machine learning capability to basically identify attempts. Now they may not actually be purposeful, but we can now look at them.
Part of what we’re doing now is doing things ad hoc and we’re doing them by paper and doing them by human process and they are inconsistent. We don’t have competent logging. But if we built some very simple systems — What would be nice is if we could build some open source systems that become prototype systems, and then companies like Epic, et cetera, could basically incorporate them because it would be a marketing advantage so that the different EHRs might actually build that into the system. Then, all of a sudden, you have technology that you can use without having to build your own.
At least that seems like it might be a very useful approach.
MS. KLOSS: How realistic is it to be able to get to this layered or tiered approach thinking? Is anyone doing work on that? We discussed it yesterday and it came up again today, and it seems like, yes, it would be helpful not to have a black and white, one only approach, but some tiering. Could we comment on that?
MS. MONSON: I think I commented on the tiered approach. I am not sure that anybody has more than an idea or concept of evaluation of it in a layered approach where we’re not just focused on the expert, we’re not just focused on the 18 identifiers, but we’re evaluating based on facts and benefits of those datasets and, based on that, then having a criterion by which you determine what kind of a dataset and what it would look like. So, not just focusing on de-identification. This is what I would suggest doing, which still provides a little bit of a black and white approach to it but provides more flexibility for innovation and technology to continuously change, which we know it’s going to do.
MS. KLOSS: And it could relate to the use case discussion we had earlier.
MS. MONSON: Yes. And I think you would want to evaluate using use cases to come to that criteria.
MS. KLOSS: And based on state-of-the-art today, this would be a research area. It’s not anything that could be operationalized, but it would be a fruitful area for research. Would that be a fair conclusion?
DR. ERLICH: I think in terms of research there is some level that we know how much each identifier contributes to identification and we can quantify that using entropy, for instance.
Also, there is one system that works that shifts the discussion — instead of protecting the data in a layered way, having a consent in a layered way. So Genetic Alliance and Private Access provide a website called Peer that individuals can say I want to share my data, and they have kind of like tiers of who they want to share the data with, and they have some data stewardship.
Some people from the community say I’m a patient survivor and I suggest you share the data in this way because I think you should be more aggressive about data sharing. Or, I’m a person who is more concerned about my privacy and I think you should just share the data in another way.
So this allows people really the granularity to see what they want to expose and to whom they want to expose these datasets.
DR. CURTIS: When we were talking about a shared approach that was more what I was indicating, and I do think it’s happening on a very one-off basis by companies and by organizations. I think what would be useful potentially would be investments in creating a common language and common standards as to what those tiers might look like, what’s the standard for communicating effectively with knowledge transmission to patients as to what those tiers actually mean at the end of the day.
Obviously, there would be a lot of complications implementing that, but I think getting that process started in a unified way as opposed to 1,000 different individual efforts would be very useful and money well spent.
MS. KLOSS: Thank you so much. This has been a great morning. We are due for a break and will reconvene at 11:30, and our panel participants are all here.
MS. KLOSS: Okay, we will reconvene our hearing. We are very grateful for our next panelists who are going to help us explore models for privacy preserving and use of private information. We’ll kick off with Micah Altman. Dr. Altman is Director of Research, MIT Libraries, and we ask you to start us off and introduce yourself a little further if you like.
DR. ALTMAN: Thank you for the opportunity to testify here. I am Micah Altman, Director of Research at the MIT Libraries and a non-resident fellow at Brookings. That said, these opinions are entirely my own and not anybody else’s, including those of my co-authors, though this comment is strongly informed by research with collaborators through the Privacy Tools for Sharing Research Data Project at Harvard, which is a multidisciplinary project — social scientists, statisticians, computer scientists, lawyers — and is focused on making research data more usable and confidential at the same time.
If you like something in this talk then it’s probably due to my colleagues, but if I make any errors, it’s my own fault. That said, I recommend that you read the previous things we have on record and we sent copies to the committee in which we explained these in detail, and my errors won’t be as much an issue.
One of the things we pointed out as a recommendation generally in this area — and I will note that we are not focused particularly on HIPAA though we have done a little bit in that area, but more generally on the set of emerging technologies and approaches for privacy and their use in both government regulation and research settings.
We have noted that terms such as privacy, confidentiality, identifiability and sensitivity are used in many different ways in different fields. We heard some differences from prior testimony. Depending on what field you’re coming from, these may have different meanings, so it’s important when issuing regulations or when issuing commentary to anchor those meanings to a particular discipline or just be explicit about that.
I am not going to argue for one definition over another, but just to say that when I talk about these concepts, privacy refers generally to control over the extent and circumstances of the sharing as opposed to confidentiality, which is how that information that was disclosed is further shared, and distinguish between identifiability which is what one learns about someone from looking at the data or data products, and sensitivity which is the potential for harm.
In particular, anonymization, at least in the fields that I work on, is I think best viewed as a legal concept. Anonymization and de-identification do not have strong formal analogs in the literature that we work with, and I think we heard yesterday from Dr. Barth-Jones and others that there is a lack of clarity over what constitutes a re-identification. So, it would be reasonable to define those specifically in the context of regulation rather than looking for a universal definition of these things.
I will talk instead about the challenges to identifiability and to protecting personal information — confidentiality within de-identification or re-identification. There are three large challenges that we have observed in this research.
The first is that many human behaviors leave behind distinct behavioral fingerprints. For example, geo-location data has been in the news a lot. It has been in many scientific publications, and it’s not just the level of detail that this provides but that people’s patterns of movement are highly predictable. And even at high levels of aggregation, this can be sometimes matched back to individuals.
This is not necessarily a problem even of big data but of small data in a big data world. If you have a lot of information you can observe people’s locations in other contexts. Even some very aggregated information might allow you to put two and two together. And we’re seeing this with all sorts of behaviors.
When data is released that’s generally protected by traditional methods like removal of particular identifiers it’s not zero risk, and we have heard that, also, from previous people, from Barth-Jones, from Dr. Malin and from others. The risk is small if the traditional methods are applied correctly, but it continues to grow and grow in the context of future releases. There is a composition effect, as also has been noted earlier. If you put together data from multiple sources, you learn more.
And there is essentially no free lunch. Every release of data, whether it’s making a table, making a summary from a model, doing a visualization, anything that relies on data that is useful also reveals at least a tiny bit of information about the people who contributed to that data. This is an inevitable part of the mathematics of this. The challenge is to balance this.
We have done a lot of work in an area called differential privacy which is not really — It has been characterized, for example, yesterday as part of a family of interactive methods. That is partially true. It’s often used as a basis for interactive query servers that protect the data behind them and for ways of blurring the data so that the results don’t give too much away. But it’s really not an interactive method but a theory about how you measure learning that has become very well accepted in the cryptography community.
The problem is protecting against learning. We have gone from learning Where is Waldo to can we use data to find someone in a database, to finding people in a crowd, to learning something about them. And it’s in this big data world that we need to be concerned about the accumulation of all these little bits of learning into what might constitute an attribute, a known attribute about the person.
Similarly, when we talk about sensitivity, measures of sensitivity essentially go back to threat models, and we have heard about the importance of this.
In the service of time I’ll skip ahead except to talk about utility, which is the analytical value of the data. There is an inevitable tradeoff between utility and the information linked. There is no free lunch. But traditional methods such as removal of fixed identifiers often do not fall on the frontier of what is the optimum you can get in terms of inference or privacy, so some of these emerging methods allow for better tradeoffs.
We offered in previous research a number of principles to approaching this. One is to calibrate the privacy and security controls to the intended uses and risks; to consider not just re-identification risk but the entire spectrum of inference risks because, again, they compose in a big data world; to use not a single set of controls like removal of fields but to use a combination of controls — and we have heard some of these before. De-identification removal of identifiers should not be considered a silver bullet; it may make sense in the context of other things like an assessment of the risk to individuals from that particular data, meaningful consent control of the data, data usage agreements, but there is no single control that makes this go away.
Particularly, we recommended looking at the entire lifecycle of data management from collection through transformation, retention, access, and then after access because the data exists and it is out there. Even if the data itself has been modified to reduce risk, it still poses some risk. Controls can be applied; methods can be applied in each of these stages to minimize the risks to individuals.
This research has developed catalogs of controls from literature reviews of legal literature, statistics and cryptography. You can think of the controls as coming from different fields — procedural, economic, educational, legal, technical — at different phases. You can also think of them as how are we trying to limit how people compute on the data, how are we trying to limit how they make inferences from those computations, and how do we limit the uses to which they put those inferences. So there are three classes of activity we might target.
Along with Malin and Rubinstein and Herzog, this essentially means privacy needs to be designed into the system. There’s a broader set of controls. Most government regulations do not take advantage of this spectrum of controls. And there is no set of controls that provide for zero, that are a silver bullet in and of themselves.
But this is consistent with the principle frameworks that we’re familiar with in the past. If you look at OCD principles or privacy by design or the fair information practices, they all emphasize looking at the lifecycle of data, not necessarily in the same detail or the same stages we have used, and all emphasized the need to plan from the collection process on up, both for sharing and for analysis.
What we believe would be an advance over this guidance is developing a catalog of privacy controls, like there are catalogs of security controls that are accepted at various stages in the lifecycle, and we have developed a prototype for that, and to have expert clearinghouses that discuss what are appropriate controls for different levels of risks and give guidance on implementing those. These will change over time, so embodying one set of controls into the literature and into the regulation probably will not stand the test of time, but having some framework, as FSMA is to security, that has an evolving set of controls that are mapped to larger categories of threats is flexible and can grow over time.
Finally, to put this all together, we have an example of how one might calibrate controls to risks. In general, as we have heard before, implementing a single set of privacy and security controls is probably not going to address all the intended use of the information, so we recommend that a tiered access model is anticipated and that data may be available under different sets of controls — richer access to data may be available under different sets of controls.
For example, if the data has direct and indirect identifiers and it’s risky in that it causes a significant and lasting effect on people’s lives — and these are rough categories but they’re amenable not to perfect quantification but to some quantification types that IRBs do now — then this can be offered in sensitive, secure data enclaves where there is a vetting of use and there’s auditing of the analysis; the data is used within a particular system. Models of this include the Census RDC, it includes secure data enclave at NORC.
If the data is more protected by removal of identifiers by technical means, there’s less learning risk or less harm risk downstream. It might be released with a more formal application — still not a public release but it would be released with some vetting and oversight to researchers or organizations that are known and monitorable.
In the middle, as we either apply traditional statistical disclosure limitation techniques by experts or the damage or expected harm is smaller, you might release these with simple notice and consent in a data use agreement, in the click-through model. And if you release it entirely publicly with no further click-throughs, no further monitoring, it should either be established that the future harm is essentially negligible or some forms of rigorous formal models for measuring learning should be used, like differential privacy.
This is, again, based on four papers that we published in various places, and these contain hundreds of references to these technologies and different interventions. These have been circulated. Thank you again for the opportunity to speak on this.
MS. KLOSS: Thank you very much. We’ll look forward to probing a little more deeply into these excellent slides in our Q&A session.
Next we have Sheila Colclasure, Dr. Colclasure, Privacy Officer at Axiom, and we welcome you.
MS. COLCLASURE: Thank you so much for having me and thank you for the promotion. It’s a secret childhood dream of mine so that was especially fun for me today. I am Sheila Colclasure and I’ll give just a little background on me and then background on the company. Then I’m going to talk to you about the commercial perspective because Axiom is an infrastructure player.
I’m a native Arkansan. After grad school I moved to D.C. and I’ve lived and worked here for about 10 years in the U.S. Senate and then off the Hill at the AICBA where I managed congressional and political affairs. Then I moved back and went to work for Axiom, so I am really thrilled to be here today. Thank you so much for having me.
This is very important work because healthcare, the health vertical, is white hot, and I would argue it’s the thing that is the most central and intimate to each of our lives — how we feel every day, our wellbeing, and those tools, apps, providers that contribute to that, and then how we pay for it. So it’s a very important piece of work here, something I am extremely passionate about.
I have been at Axiom for 20 years, and for 20 years I have run their operations on privacy and data governance and data compliance. Recently, my boss semi-retired and I have stepped into the global role, so I’ve taken on a global perspective and I am the global public policy and privacy officer, and all that that means.
To tell you a little bit about Axiom, it’s an old and venerable technology company. We have been around since 1969, and we’re at about $1 billion in revenue. Our client set is the Fortune 100 mostly. Virtually every dollar that we create we create by doing data work, and it’s either data products that we curate, and it might be an identity or fraud detection data product or a marketing data product.
But the other roughly 80 percent of our revenue is work on our clients’ data. That is, cleaning the data, transforming the data, hygiening the data, enabling the data, building an identity solution or recognition solution, doing a marketing solution. So everything we do is data intensive for 45-plus years.
Our privacy program started because we are so data intensive. Privacy got hot for us in 1991. We have the oldest privacy program on record, and my predecessor was the first chief privacy officer anywhere, so we had the first program, and we have a very matured and evolved program.
What we do is we approach it with an ethical construct, and this is very important because as we have been in business for 45-plus years, we’ve operated across all the verticals and we have seen every piece of data that’s available, how the data is used, how clients want to use the data, what data is not available, what business problems data solves, what it doesn’t solve. And we’re doing a very rapid pivot to the truly big data. We have always been big data — and I want to talk about that in just a minute, but this is where I want to start, essentially.
Data is really dirty stuff, and most of the clients we deal with — we have come from this traditional world of personally identifiable data. In the health vertical, we have those clients as well. In all these other verticals — financial services in particular — they have invested so heavily and so aggressively in data management and data recognition because they have interoperability problems across their data silos.
One of our clients, as an example, without naming names, when we began their data work they had roughly 1,001 different data silos, and they had different channels and different intake methodologies. It’s very analogous to the health vertical where you have a consumer or patient that comes into a healthcare system for care across an entire ecosystem. And at each of those touch points, there is some flavor of intake where there is human error or technical error, or the data gets siloed and stuck, and the value of the associative data, associated with the identity of the individual, becomes latent. You can’t extract it.
So, in all these data silos, the number one problem that the other verticals have come after is solving the dirty data problem. How do you accurately recognize a person, integrate that and then you can bring forward all the attendant data and extract the value. And this is the underlying problem, in my view, in the health vertical no one has really grappled with yet, that you have to ingest the different strands of data, the different feeds.
In my example — and this is one of the largest financial service companies in the world that we work for — it’s a continuing problem because data degrades so rapidly and a lot of it is intaken via human beings. So it’s consistent; you have to constantly correct, fix, parse, transform, get the data accurate and then have a technological means to accurately integrate the data.
Do all of that data transformation, achieving accuracy, then you decide how to use the data. So you fix the data; you achieve accurate recognition of the person the data relates to, and then you use the data. At that point, you make a decision, whatever your use is, maybe it needs to be de-identified and maybe it doesn’t. Maybe it needs to be de-identified in such a way that there’s an escrow key and an escrow function so that down the road, when you’re like, oh, my goodness we’ve discovered or we think there’s this huge benefit that’s life-changing, you have a means, a secure, permissible, reliable means to connect the data, but it travels and can be transferred in a way that’s controlled.
This’s the way we have come at the data and solved the interoperability problem, as I say, well ahead of where the healthcare vertical is now. We have moved very aggressively in financial services, which is heavily regulated in Telco, in gaming, in technology itself, in companies that deal with children’s data, and they are well ahead and they’re coming at it in these measured steps.
I want to talk for just a moment about the de-identification problem after you achieve the accuracy of the data so you can have an accurate representation or recognition integration of all the different records and you want to de-identify. There are many different methods on the market.
The one that we have innovated, patent pending, all of our technologists had to withstand the scrutiny — our most important industry vertical is financial services, so our method has withstood the scrutiny of all the Fiserv clients. They have come in and kicked our tires, and I’ll tell you this and talk about our method in just a moment. But this is what we’ve learned.
It really takes two things. It takes a technological treatment of the data to achieve the de-identification, but that is not enough. Those are technical controls or technical mitigations that prevent. But you also have to have administrative controls; that is, all the policy construct that says what you will and won’t do, how you will and won’t use the data, all the contractual promises.
And it takes the two things plus an accountability system. This is the way that we apply our data work across all the industries where we work, including healthcare, where we solve their dirty data problem and we achieve accurate recognition or integration of all their different silos of data. When they need to share with another health partner, we do the same piece of work and they have, with precision, extreme precision.
We tune it for the healthcare vertical. Health is more important than anything else. This affects lives and outcome and wellbeing. It’s not a matter of marketing — did I get the right piece of marketing material or the right offer — this affects lives. So, in the health vertical, it requires even more precision than we apply in other places so we have specially tuned rules for healthcare.
In the process that Axiom uses, what we know — and this is a theme that I keep hearing around the room — this idea of trust. Business moves at the speed of trust, so you have to have an accountability program that ensures not only have you applied the technical means at the engineering layer to whatever you’re doing, but you have also applied the administrative controls and held yourself accountable.
I’m just going to tell you how Axiom does it because we’ve been doing it for a very long time, and we have to withstand the scrutiny of the largest brands in the world, and they trust us with their data to keep it secure because security is table stakes, and then to enable the uses of it, the accuracy of it, the de-identifiability of it in a secured, reliable, accountable means.
Our program is called the Ethical Contextual Interrogation, and it’s what we’ve heard in layman’s language as the privacy impact assessment. Every client implementation that we do, every recognition event that we do goes through this process where we have stakeholders. We have four stakeholders at the table. We use a privacy person, a legal person, a security specialist and an engineering person. So we have four stakeholders that examine the design, the rules, the outcome. This is the outcome that the data enablement was intended to achieve; this would be the use construct for healthcare, if we’re going to share with another entity.
These four stakeholders come together on these contextual ethical interrogations and we characterize the project. We identify every data stream. We measure for any legalities. Is it regulated data? Is it co-regulated data? What are the algorithms that are going to be acting on the data? We measure for outcomes. Is it legal? Is it just, and is it fair, just the right use of data? Is it fair not just to the healthcare provider or the pharmacy or the insurance payer — is it not just fair to all of those brands but is it also fair to the person? Is this a fair use of data? This contextual ethical interrogation happens at the engineering layer and we identify technical controls, administrative controls at the design, and then we enable the solution. This is our accountability program to ensure that we’re getting it right every time.
I want to pivot for just a moment and quote the very famous William James, who is credited with the Father of American Psychology and who taught the first psych course at Harvard. William James said that people tend to become who they think they are or what they think about themselves. We see society becoming what they think about themselves.
How many people have a smart phone? How many people have a few apps on their phone? Any of those a health app, or any of us have a smart device, a bioband or sensor-embedded shoes or sensor-embedded shirt? Anybody bought a brand new car that has a retinal fatigue sensor on the rear view mirror? All of these things are coming, and we are rapidly adopting them.
So the pivot is this. Axiom comes from this very traditional world where we overcome all the data challenges and we deliver value and we hold ourselves accountable, and we measure ourselves with the ethical standard. You have to have trust. Our brands have to be able to trust us. We’re operating in the health space.
Here’s the extreme pivot. We really are moving into a world of big data where most of the data that’s collected about each one of us is observational in nature. We’re going to a pivot where 90 percent of the data that’s created, captured, used and shared is observational. And with rapid adoption, we are all very interested in our own wellbeing and we’re harnessing the power of technology and big data to learn more about our bodies and to take better care of our bodies and to improve our own wellbeing.
We are also moving towards convergence where, like Dr. George Savage — I’m sure you guys know the inventor of Proteus, the sensor-embedded med, amazing technology that will help us keep our parents — if you have an aging parent like I do — in the home and you can manage them from afar. This is amazing technology, and we are all rapidly adopting it. So there is opportunity for great benefit. There’s also great opportunity for harm.
So it’s two things. It’s about the governance and accountability; it’s about enabling the benefit and identifying and protecting against the harm. All of that has to be done against a set of social values, what we collectively believe as a society is okay. We may be very different than, say, a society in Germany. The American sensibility may be different.
So, collectively, the way that we enable all of these amazing new benefits, and when we achieve this thing, convergence — my Fitbit talks to my app via Bluetooth and then I decide to share that with my doctor. Or my sensor-embedded med talks to my Bluetooth-enabled wearable and that’s great, and I’m going to share that with my doctor or not, or manage it myself.
When you converge — and I think we are moving towards rapid convergence — it brings all of these questions into play. As a society, how do we enable the benefits? How do we detect and prevent against the harms? What do we identify as harms? How do we implement systems of accountability?
While we’re talking today about identification and de-identification, Axiom uses a quad hash with double-salts in the runs. It’s a very sophisticated technique. We really also need to focus on how we’re going to treat that when your name and your address and email and phone number become irrelevant and we’re working in a device-identifiable or an entity-identifiable world.
I’ll pause for just a moment with my commercial view and hand it back over.
MS. KLOSS: Thank you. We will move to Kim Gray. Kim is the Chief Privacy Officer for Global IMS Health.
MS. GRAY: Hi, everyone. I’m Kim Gray and I am a year older today. I am the Global Chief Privacy Officer for IMS Health. IMS Health is one of the leading processors and users and maintainers of healthcare information worldwide. We operate in more than 100 countries and we have been doing this kind of thing for some 60-plus years at this point in time, primarily using de-identified data.
We were doing data de-identification before HIPAA, obviously, but I’m primarily here today to tell you that I think HIPAA is actually a really good thing, and I think HIPAA does a wonderful job of giving us guidance around de-identification. That may not be popular, and I have heard a lot since I’ve been here about changes in the offing.
But having worked with this kind of data for a very long time now — I’ve been at IMS for seven years, and prior to my tenure at IMS I was a chief privacy officer for a health insurance company. I am one of the old school chief privacy officers, and I know HIPAA inside, outside, upside-down and backwards in my sleep, and I actually like it.
Never thought I would say that when we first had to comply with it, and it was a pain for everyone. But HIPAA did a very good job of anticipating things. It is more nimble than we give it credit for being.
In my world of de-identification, it allows for beneficial uses of data while protecting patient privacy, and the guidance that is there is not too black and not too white and not too gray. It allows for some usage that — like, I said, it’s nimble. It allows for changes in technology. It anticipates changes in technology, in fact, and in a big data world — and we are one of those; our company handles many petabytes of data, so I’m familiar with using large masses of data. But HIPAA actually allows us that flexibility that we need. It’s the gold standard, if you will.
There are other standards out there. If you look around the world you’ll find other standards on anonymization, which is the term of art used more frequently outside the U.S., as well as de-identification. And if you look at those, in my world I keep coming back to HIPAA. It gives you better guidance than the other standards do. And these standards apply from the U.K. to Australia to Canada to you name it, and then within industry sectors, too, not just by geography or governments.
The balancing of these principles I think is what’s key because we can’t lose sight — While patient privacy protection is, obviously, very, very important, so is the use of the data. If we get to a point where we are over privacy protecting and ignoring those beneficial uses of data, we’re hindering the free flow of information that allows for innovation, which allows for better public health and improved access to care, less medical errors, all those wonderful things that can come from data on a community level.
But if you think about it, it’s not just on the community level because, by the way, public health benefits help us as individuals, too. At the end of the day, we’re all patients or we’re married to them or we have parents who are or whatever. Even if you’re looking at that balance as being between a community standard and privacy for an individual, that community standard translates into individuals as well.
I have heard, since I’ve been here today — and I apologize. I didn’t get to hear the speakers yesterday because I only wanted to spend my birthday here and not two days —
I have heard a lot about zero risk, and I want to echo that, too
When HIPAA was being drafted, that was considered at the time as well, should we have a standard with zero risk. That was discarded because if you have zero risk, it improves that patient privacy protection just this much, just a minuscule amount, but then hinders that use of the data which is so important. I find it interesting that in discussions on de-identification — and I am party to a lot of them, fortunately or unfortunately — we tend to hear a lot of naysayers talking about de-identification not being such a great thing. There’s a high re-identification risk.
And what that really translates into is an assumption that there should be that zero risk. If you peel away the layers of the onion and you talk to the people, you find out that if there’s any chink in that armor at all and anyone can ever be re-identified, then we’re going to throw the whole standard out. To me, that’s ludicrous.
There’s a comparison that some have made to information security. We never seek to have zero risk in information security. We’re always looking for what is the risk and how do we manage it. That’s an important principle and I think we need to pay attention to that, but HIPAA anticipated that as well.
I’m here today to say, in my opinion, while a lot has changed in the healthcare landscape since HIPAA came about, it is still a very valid way of dealing with things. When you are correctly, properly, appropriately perhaps de-identifying data it’s not just a simple exercise. It does assume that you have to think about a few different things. You’re removing or disguising or doing something else to the identifiers, but you are also putting those safeguards in place.
So this does not become a statistical exercise only. When we say expert determination we mean that. It’s a more holistic approach, and we’re looking at the administrative, technical and physical controls around that data as well as the removal of the identifiers.
Then, in the back of our minds we have to be thinking about what’s going to happen with that data. How are we going to use that data, what’s the utility? You can’t over de-identify to a point where the data is useless, but you need to have it be robust so that the patient privacy is protected.
I think there is a lot of confusion on when something is de-identified but I think that might be — if I have any criticism of the HIPAA standard, I think that the actual education about it might be where we could do some improvements.
To that end, we had lots of conversations in privacy leadership settings where the HIPAA standard would come up and people would say, well, how do you actually go about doing the de-identification? Who are these experts? And I heard someone say earlier there were only five of them, and yes, that was always a criticism, too. We could name the five guys — and we use that term very generically — who were the experts at doing the de-identification, as they called it, certifications. Even though HIPAA doesn’t use the word “certification” that’s what was being bandied about.
So a group of us got together and said let’s do something about this. Let’s enhance HIPAA, not change it but enhance it, and to enhance it, let’s see if an organization that’s already in existence called HITRUST might be willing to partner with us to make better knowledge about the de-identification methodologies, to maybe have more of those experts, and to do other kinds of things that would make the HIPAA standard, still being the gold standard, a little bit more understandable to people.
So, with that, we created the HITRUST de-identification standard. Those of who worked on this — this working group was in existence for little more than two years, probably. We came from different organizations and in fact sometimes competing organizations, but we wanted to see a benefit to everyone. We decided what we wanted to address were the categories of health information, the reason for that being you have fully identified information at one end of the spectrum and you’ve got fully anonymous on the other end, but you have so many other aspects of types of information and it’s contextual as to how de-identified you need to make something. Right?
You look at the use cases and you look at the intended recipient, and on one end you maybe want to have it a little tighter than the other. If you’re going to publish your data to the Internet, it ought to be pretty de-identified. If you’re going to be sharing it with a trusted recipient and putting lots of safeguards in place, maybe a little less so as to allow more data utility to allow your researchers to actually do the research they need to do.
So the HITRUST framework also evaluates the de-identification methodologies. It took a look at expert qualifications, what should they be, and they’re pretty broad. We hoped that by doing that and putting a little more meat on the bones of that we could actually get more people who would want to be experts and to help out.
We took a look at re-identification risks, how do you measure that. And probably most importantly, we mapped this de-identification framework to the existing HITRUST common security framework, or CSF. HITRUST came about some eight years ago with the idea that if there is going to be sharing of health information through electronic health records, health information exchanges, if you will, everybody needs to be upholding the same security standards and safeguarding that data. You can’t have one organization that’s part of that sharing be the problem.
So the common security framework level-sets and puts controls in place that should be used by organizations sharing data. We have mapped the de-identification framework to those controls so you have more of an objective way of looking at de-identification, a little less subjectivity to it. Right now, there’s a training program underway to the HITRUST framework, so hopefully we really will get more of those folks interested in becoming experts in this and have more options for those who can look at a dataset and say whether it’s de-identified and has a very low risk of re-identification.
In addition, assessments against organizations that want to maybe get the framework part of their culture and be certified. And then, last but not least, regulatory support. We see the framework as being something that not only helps business and helps the patient at the end of the day, but helps the regulators have something that they can use as guidance, if you will.
I will quickly go through our recommendations for a de-identification program. This is coming from the HITRUST framework. These are things I’m sure others have talked about before me so I won’t spend a lot of time, but it’s important that you note that this is holistic.
There’s governance, who is in charge of de-identification. Who is documenting it, who are the recipients, who is the custodian? Are you going to have someone external come in and take a look at it, and then an agreement that no one is going to re-identify?
A second set of recommendations for a DID program was setting risk thresholds, deciding what the risk is that you need for your use case, for your particular dataset, which will be fluid; measure the actual risks; identify what your direct identifiers and your indirect identifiers are; who are the possible adversaries who might try to re-identify it; how you are going to transform the data; the template for doing this so it’s a repeatable process; and, probably very important, the mitigating controls so you stop any risk, and then the data utility.
So that’s my plug for HITRUST. I would be happy to entertain any questions on HITRUST or IMS as far as how we do de-identification, and I thank you very much for the opportunity today.
MS. KLOSS: Thank you very much. We will go to questions. Sallie?
MS. MILAM: Great discussion and lots of good information and thoughts. Thank you all for being here today and sharing with us.
Micah, I think you stated that we need a catalog of privacy controls like security has, and I have heard others speak to needing external review. We have talked about assessments.
I’m wondering if you all have reviewed any of the new privacy standards out there, the new ones in NIST, and if you could speak to what you think might — Are they evolved enough? Where are we with sort of an internationally or nationally recognized set of privacy controls, and what do we need to do to get there?
DR. ALTMAN: I will speak as myself rather than the collaboration. In some of the work we reference the FISMA set of controls. In the last revision of FISMA they added a number of privacy controls. It’s my opinion that that is progress forward, that those are a reasonable set of controls to consider.
The framework for the FISMA security standard tends to focus, as a lot of information security does, on what happens within the system. So, when we talk about controls to limit inferences on the data later, the background is not as strong there. I understand there is work in progress in this de-identification group. There’s also work in the NIST big data working group where they’re looking at privacy.
My observation is that emerging methods like secure multi-party computation, more formal measures and controls on learning that go beyond removal of a particular ID or measuring of a record linkage risk to a more general learning risk and protect against the mosaic effect, those areas the current catalogs don’t have a lot, and we need development in those areas.
The set of controls over downstream uses — it’s very difficult to communicate to users what the possible uses are or even to communicate across organizations. So we need better, more formal taxonomies about what are uses that are interoperable so that people can easily see and understand what their data is being used for. And there is some ability to do, say, automated auditing on this without parsing two months of click-through agreements every year, as some of the work at CMU. If you read all the license agreements right now, you’re going to spend months of your life reading them.
MS. GRAY: I have some familiarity with NIST Appendix J, and the one thing that struck me when — and it has probably been two years since — when it first came out I guess is when I spent a lot of time looking at that. To me, it was very much like HIPAA with a few additional controls added for those entities that were working with governmental units.
We actually have a privacy standard on the HITRUST framework, too. Not just the DID, but privacy has now been embedded into the common security framework, too, and I was on that working group as well. We looked at Appendix J and chose not to put all of those standards in as required standards because our feeling was that the additional ones above and beyond HIPAA may well keep organizations away, that the additional governmental requirements in Appendix J above and beyond HIPAA were enough that a lot of organizations were not going to want to do them if they didn’t have to.
Unfortunately, we live in a society where you do what you need to do for the most part. There are some organizations that will go above and beyond. Axiom is one; IMS is another. We tend to see each other in a lot of these because we do that. But, for the most part, a lot of organizations wish to stay at just what they need to do, and if HIPAA was the only requirement, going beyond that was going to require financial resources, human resources, other kinds of capital expenditures that really people were not going to do.
MS. COLCLASURE: For Axiom, we actually do HITRUST in our health vertical when we serve those clients, and it’s a very important standard. We also do NIST. We have analyzed the NIST standards and we do something called a cross-map where our layered security approach and our data governance approach are very evolved. We took both the NIST and the HITRUST standards and we did a cross-mapping of all the other layers of controls and audits and accountability and processes and governance, and we essentially went through and checked to make sure that we were solving for each one of those and we could very carefully map it.
But I thought the NIST was very well articulated, a very thoughtful piece of work, and HITRUST as well. Not only is it really well done but it’s a requirement now in the health vertical with our health clients.
MS. MILAM: Are you willing to share that cross-walk with us?
MS. COLCLASURE: I will need to talk to my chief security officer. That may be confidential because that’s typically something we produce for our clients.
DR. MILAM: Not the contents; just the map.
MS. COLCLASURE: Maybe. Let me talk to my CSO and we’ll see what we can get you. Would you be interested in both the HITRUST and the NIST? Okay.
MS. GRAY: And HITRUST does anticipate — I have to get my cheat sheet to see. It also harmonizes HIPAA, HITEC, PCI, COBIT, NIST and FTC, so all of those cross-walks were done by HITRUST for the standard common security framework.
MS. MILAM: Is that in the document itself, what you map to?
MS. GRAY: No, but I would be happy to share that with you.
MS. MILAM: Thank you. That would be helpful in so many scenarios.
MS. GRAY: Anybody who wants any HITRUST things, just talk to me. It’s kind of comprehensive and a little hard to narrow what you would need. But looking at the DID part of that would probably be helpful and I would be more than happy to share that.
And a lot of the other HITRUST documents are available on their website. There’s licensing for some. Some of it, there’s no charge; others, there is.
DR. PHILLIPS: Thank you all very much. I have to say I appreciate the level of sophistication across all three of you. Whether it’s HITRUST or extrapolating over from your work in the financial vertical to the health vertical or, Micah, even you talking about having these almost titratable privacy controls, I think this is amazingly helpful.
Where I get concerned is reflecting back on the discussion we had previously where you have front line data collectors with data controls that don’t have this level of sophistication and probably never will. How do we diffuse what you’re doing out to front line data collectors, or how do we create relationships for them with groups like you so that there is a safe place for them to deposit their data for its utility elsewhere? Is that the solution? Or which of those is the solution, or do you see a different one?
DR. ALTMAN: I think it’s unrealistic to ask people to hire us, a flock of PhDs, to look at all of these things. Having progress on catalogs of controls where we bring this further into NIST or other accepted documentation, but especially bringing in the newer technologies such as secure multiparty computation and formal measures of privacy, which are not well covered in this area yet, developing standardized licensing agreements and explanations of those — it’s not just the implementers who are confused; it’s the consumers. I think there’s just too much information, and it’s very difficult to understand how your information will be used and what any particular third party is going to do with it or not.
It’s not because people aren’t trying to do due diligence, but it’s because there are not standardized languages and descriptions, the way you can go to Creative Comments and get icons for how your information will be shared in an intellectual property sense.
Building in expert panels who will help develop — or expert boards — who helped develop guidance and supplemented these catalogs because these need to be continuing things, and developing some incentives and systems to be interoperable with the idea that the data is going to be shared and that information about the uses and restrictions. It’s not going to be transformed into something that’s probably entirely safe, or at least for most purposes, and so systems need to be built to design privacy in and to be able to inter-operate with other systems. I think that is the biggest challenge for implementers, is crossing systems.
DR. PHILLIPS: I appreciate that. Having a catalog of these is important, but it also means you have to have either sophistication or clarity on the other side of which data do you apply which catalog elements to. I’m not always sure these front line providers — I know front line providers don’t have it.
So how do we actually even implement a catalog like that at a provider level? That’s where I’m really struggling.
MS. COLCLASURE: Could I just jump in and ask, when you say provider you mean the healthcare provider.
DR. PHILLIPS: I do.
MS. COLCLAUSRE: You’re absolutely right in my view that the intake personnel that are collecting data from the patient are never going to have adequate training. They’re never going to intake data correctly and appropriately every time, and it’s unlikely they are going to absorb and apply all the confidentiality rules. We have seen plenty of that where they don’t have sufficient understanding. Either intended or unintended, the data gets misused or misappropriated. And that is an issue, certainly.
In the world of managing the data, I think it’s exactly what Micah said. That is, you have got to have the design. The intake tools themselves have to have thoughtful design that does some field-by-field — with the clients that we have, this is what we do. The field-by-field, there’s something that checks the accuracy of the data first, maybe hits a referential database and does some real time correction. That’s one.
Maybe the technique to keep the data from being copied or misappropriated, the tool itself has that design consideration in it where you can’t make copies, you can’t screen shot it. And maybe the field, once keyed in and real-time validated, maybe they are obscure or partially obscured from the intake person. So you overcome in your design, when you’re designing the tool and the intake method. You apply consideration at the design stage and you design it the best you can. Then you apply training. That’s what takes care of that rule.
The issue that we bumped into the most — One of our clients that’s outside of Fiserve the payers, the health insurance companies, and we have the largest ones. We do their data work. They’re dealing with data that has been intaken many different channels, many different means, many different ways. That is beyond our reach to affect all of those intake tools, but when we get the data with all of its endemic problems, then we do correction.
So I would suggest you need to parse out the issue but still need to do what Micah says and what Kim says. At the design stage, contemplate what — use case it. What are the problems? Let’s design the tool the best we can to overcome it. Let’s write policy. But it’s really about the technical capability, especially as we accelerate into the world of big data where we don’t even know yet what kinds of discoveries from the good uses of data we are going to make or want or need.
DR. ALTMAN: I would agree with that, especially around the use cases. I’ll note that in the testimony and the slides referred to are comments on the proposed notice of rulemaking and the comment rule, and there, some of the elements of the guidance we suggest are scenarios of use cases for what are informational harms, how do you recognize different levels of sensitivity, different elements of data that you’re collecting that might be sensitive. So that is one element of guidance.
I’ll add one more thing which is that, in the use case realm, this is the flip side of the no free lunch. Every time you have something useful you lose a little privacy. Every time you put in a protection that reduces the data in some way, every time you take some information out of the data you also lose on the inference side, whether it’s accuracy or bias, and generally both. So you also need use cases for how is the data going to be analyzed later, and what decisions are you going to be making on it and what sort of protections are compatible with those downstream uses.
MS. COLCLASURE: One of the big ideas in the think tank groups that we participate in around the globe is the parsing of the two constructs of privacy — this idea of I’m giving you notice and you’re getting some choices to participate or not. That’s privacy. And the other idea is data protection or the full and complete bare processing of data which includes all of the data governance aspects.
If you consider those two issues and you parse it out that way, the next big issue is — and you’ll hear it called — the bolted-on versus the baked-in. You use case everything, and instead of trying to put a control after something is designed and stood up, be it an intake method or a use and enablement of the data, rather than then trying to decide, you back it up. This is the big idea now around the world with the idea of accountability or privacy by design or the engineering.
It’s this idea of the fully accountable enterprise, be it an intake moment on some channel — at the design stage, you bring all these considerations in the design and you gate it through the implementation, anticipating what you’re trying to accomplish and making sure it’s feature functionality embedded at either intake or then use of the data, knowing that we’re going to speed. The use of the data we’re gaining velocity on and we’re moving towards these real time inflection moments where the data is going to be resident in some system. The patient or the user will interact; you’ll have an algorithm that will pull down the data, do some real time analysis and inform that patient or user experience.
We’re moving to that kind of interaction, so the accountability process has to move way upstream so that you have accounted for potential outcomes, potential harms and benefits way up here in the design before you get to that moment of interaction because we’re moving to a world of algorithmically-driven experiences. And we’re going very quickly to that world, quicker than any of us imagined.
If you could see what I see in my R&D shop where we’re building the things that our brands, including health, are going to offer in 12 months. If you could see what I can see, you would know that we’re going to get to this point a lot faster. We’re moving toward zero latency where I walk in and there’s something about my behavioral fingerprint on my device and I get recognized that way. And things about the apps I’ve been using get pulled in, and it informs my experience in real time. Zero latency. That’s where we’re headed. We’re certainly not there yet, but I think we’re going to get there quicker.
The big idea is it’s not just about privacy, if you think of privacy as transparency into the process and choices, but it’s about the full data protection and data governance, and moving the accountability way up here at the design stage. In a fully sensor-embedded environment, in an observational world, you have got to do that piece of work up here or you won’t have an outcome that any of us like.
MS. BERNSTEIN: Yesterday we heard some talk about the development in the commercial world data segment, what they called data segments — lists of people who have various attributes or have various interests, some of which could be quite sensitive in the health area. So you could have a list of people who are incontinent, people who have HIV, people who are interested in depression, and that sort of thing.
I’m wondering if any of you could talk about how those lists are developed. What are the kinds of data you have, where that comes from, what kind of sources, ones that we may be familiar with or maybe some of the unusual sources, so that we can better understand that business? And how that data is used by your clients — just to give us a better idea.
MS. COLCLASURE: In our business, of course, as I explained at the outset, this is what we would call the data segment idea. It’s a data product where we compile data from the marketplace into a data product. And what you’re talking about, in our vernacular, is marketing data segments where you’re selecting an audience that you’re trying to reach with some sort of message.
In our shop, we differentiate between HIPAA-collected data if it originated under a HIPAA construct — that’s one set of rules. If it did not, then we apply a different rule construct. In our shop, we’re very mindful of the sensitivity. We have on this other side non-HIPAA. This is where most of the data is —
MS. BERNSTEIN: That’s what I’m asking about, the stuff that is not HIPAA, and how that gets to you from the market.
MS. COLCLASURE: There are two flavors. There is what we call, in our vernacular, core data, meaning it’s observed data about a user or a consumer. It may be over the counter purchasing in a retail pharmacy, and we can discern interest from that and we might create an interest category from that. That you can get in the form of a list; you can get it in the form of an enhancement. That’s one flavor that we call core.
MS. BERNSTEIN: It’s an enhancement.
MS. COLCLASURE: It’s where you come to us with your customer list and say — a hospital system might say we’re trying to better understand our patient base and we want to know about their lifestyles, their sociographic, psychographic, demographic, whatever it is, and we would overlay data elements onto their CRM file, their customer file or patient file. And there are rules around how that can and cannot be used.
But to your point specifically, how do we curate or create health segment data, non-HIPAA regulated, it’s from behavior. That’s one, and that’s called core data.
And then the other, and the big one and the future of everything, and we do some of it, is algorithmically determined. So, modeled data. We have a known audience of known folks that, say, spend a lot of money on diabetic supplies at a retail pharmacy, over-the-counter supplies, every month. It doesn’t mean they have diabetes; it might mean they might be interested in it. We would build a look-alike model and we would score the population to say we think that this household looks like other households that have a propensity to be interested in diabetic supplies, so we would put a score.
Now, we believe that the sensitivity is very, very critical to get right, so, for virtually every single data element we have and every model, we classify it. We have a set of prohibited elements. We don’t do mental illness; we don’t do STDs. We don’t do the things that we feel, from a social norm perspective, are too sensitive. I don’t like it.
I’ll tell a little bird walk story. My modeling team — and they are a team of PhD statisticians and they’re amazing — came to me knowing that the health market is white hot. We have a lot of clients that want to connect with that audience.
One of the things we do, in addition to classifying, is elements that are health related we only make available to health clients so they can use the elements to connect with a relevant audience. We wouldn’t provide those elements to, say, a gaming company or a financial service company because they are not in the health vertical. So we’re very careful about the use case to make sure we believe there’s relevance and there are some ethics involved. We’re connecting for a true articulable value with the right audiences.
So here’s my bird walk. My analytics team came in and they were very excited. They had developed a 10,000 audience propensities — a really significant piece of modeling work and amazing precision. I’m like, well, that’s exciting but we’re not going to sell that to the market. I know you did a prototype but we’re not going to sell vaginal itch scores or erectile dysfunction scores. And they said, why not? We think there’s a market opportunity. I’m like, I just think that’s too sensitive, and they said, well, we’re going to escalate on you. We’re going to go to the leadership. And they did.
And I came armed with a piece of paper and I had pulled the scores of all the gentlemen in the room. When I got challenged I said, well, I want to demonstrate the sensitivity and why this is too sensitive for us right now. I don’t think this is fair from a marketing perspective. And let me just read your scores out loud to give you a sense of what we’re talking about. And they were like, you know what? Point made!
So we curate the data from the market —
MS. BERNSTEIN: I am a little worried about the successor of Sheila —
Seriously. If you personally are the gateway, then I’m worried about Sheila’s successor who might not be as sensitive to this kind of stuff as you are. Then what?
MS. COLCLASURE: I appreciate that. The good news is we have a deep bench, and this is a cultural thing. Our program, though we say privacy, our title is the ethical use of data, and we apply the ethical construct. I have a very deep bench; we live by this, we breathe by this. We talk to every client we have about this topic. This is core to what Axiom does. We bake it into our DNA. And we are at every important think tank group around the world talking about how you get this issue right. Data is powerful. Data can be used for good; it can be used for bad. Let’s make sure we have the controls around it.
MS. BERNSTEIN: So your analysts apparently don’t have it baked in.
Is this something that the company has a policy or something written down, or there’s some standard? You’re not going to give us that standard?
MS. COLCLASURE: We do. We have what we call our classification standard. We have a PIA program. Any innovation at the company has to go through PIA — again, we have the four stakeholders who come together. We surface all the facts and we make a collective judgment. Is it legal, is it just, is it fair, is it the right thing to do? I, as a consumer, or my mother or father, would we like that? Would that be good? And we apply that ethical judgment. We’re a society of humans and we have to apply the human value.
Let me answer one more question because I think this is really important. When we collect data from the marketplace — because this data is powerful and just about every brand you have ever heard of buys data — maybe not for Axiom although we have a very carefully curated dataset. We go and we identify the source of origination, the provenance of the data, and we identify its permissibility for the downstream use, knowing that when data originates it comes into being with restrictions on it. Every piece of data that we have at Axiom and every model that we create has some sort of regulation or constriction, permission or prohibition on it.
So we go back when we source data — because we don’t originate it; we source it, and then we compile it and shape it — we identify its source of origination, any permissions and prohibitions and it sticks to the data for the life of the data, and we account for that all the way out to our client’s use of the data. It’s a very rigorous program. I implemented it about eight years ago, and we improved it over time.
Now that we’re working with a bunch of digital data, we’re applying the very same construct. It’s new to the digital ecosystem because we’re elbowing our way in saying this is the right way to do it.
PARTICIPANT: Kim, do you have anything to add to that?
MS. GRAY: That was pretty comprehensive, so unless you have a specific for what we do —
MS. BERNSTEIN: I think because, in particular, you are squarely in the healthcare space and in many other spaces as well, it might be particularly useful to hear —
MS. GRAY: Well, our data come from different sources, probably some I don’t even know because data is changing all the time, but anything from, say, nursing homes to pharmacies to hospital systems to insurance plans. The data can come from — think of a healthcare organization and we’re gathering data there. I say tongue in cheek there may be sources I don’t even know because, of course, we’re doing things like Fitbits and whatever, and there is self-reporting that goes on.
All the data, however, goes through the de-identification process before it comes to us, so it either goes through our engine, as we like to call our software, placed at the supplier site or with the use of a trusted third party, so everything is coming in that way. And when you talk about segmentation, we don’t segment as it comes in the door.
What we do, like Sheila said, we have teams of people doing different things with the data. Perhaps we have a team that’s working on maybe a research project with some institution here to do some kind of healthcare research on adherence to a particular medication or following the Zika virus, or whatever it happens to be that we might be working. So they’re working on data as they need it from multiple sources as they need it.
The data is limited as to who can access it, obviously, by the role-based types of access. You can think of HIPAA-types of ways of looking at things and we model that. Even though the data is de-identified — and, by the way, we are not required to do that; we’re not a HIPAA-covered entity, and we do it anyway. We’re a business associate much of the time for a small segment of the business, but we believe in that and do it anyway.
We also do predictive modeling. We take looks at the data, analyze the data, see where it might have relevance elsewhere and do modeling.
But one important caveat is we don’t allow our data to ever be used for clinical applications, so back to the concerns about dirty data and whether or not you’ve really got the right person. And you certainly would never want to be making any kinds of healthcare decisions around somebody’s clinical aspect of care using this kind of data. It does have its limitations. De-identification is a wonderful tool but not for everything.
MS. KLOSS: Thank you all so much. I have one question for Micah. You kind of skipped over this wonderful slide on lifecycle management. Is this addressed in any of the articles that you referenced?
DR. ALTMAN: Yes. That, and the full catalog of controls and the suggested grouping of controls based on learning risk and potential downstream harm are all in the papers that are cited in the testimony and were supplied to the committee.
If I may comment on the last question, we have written on some of the ethics in this area, and I think if we’re interested generally in the harms that come to people from learning about their health, whether that’s embarrassment or loss of insurability or something else, it’s important to note that this is not only related to the domain in which the data was collected. We heard earlier this morning that you can use video data to measure heart rate. I used to run a qualitative data archive and we had audio reportings and video reportings. And what we did not realize at the time was you can use audio to diagnose Parkinson’s in some cases, or cell phone locations to look at exercise behavior — not because it contains that information but because it’s in the large context.
So, increasingly, in the big data world there are richer signals. We need people to have the ability to restrict or understand or have insight into not just the data as collected but how it is being used downstream, because those uses change.
MS. KLOSS: That’s a great note to conclude our panel on, for sure.
Thank you so much for being with us and for presenting your testimony and the other references you brought forward for us. We going to go to public comment at this time.
This is the NCVHS Subcommittee hearing on de-identification of healthcare data. Are there any public comments?
MS. KLOSS: Hearing no public comments, we are adjourned for lunch and the subcommittee will reconvene — or anybody else who wishes to stay as it is public — at 2:00 o’clock.
MS. KLOSS: Let’s reconvene, and I don’t think we need to Rebecca, to do the intros again.
MS. HINES: We do.
MS. KLOSS: All right, then we will do that. This is the continuation of the hearing on De-identification and HIPAA by the Privacy, Confidentiality, and Security Subcommittee, the National Committee on Health and Vital Statistics.
We’ve reached the portion of our meeting where we’re going to review what we’ve learned and map out a path forward. We’ll begin with Committee introductions.
My name is Linda Kloss. I am chair of this Subcommittee, member of the Full Committee, member of the Standards Subcommittee, and I have no conflicts.
DR. MAYS: Vickie Mays, University of California, Los Angeles. I am a member of the Full Committee, Pop, this one, and I chair the Workgroup on Data Access and Use and I have no conflicts.
MR. COUSSOULE: I am Nick Coussoule with BlueCross/BlueShield of Tennessee and member of this committee and the Full Committee, as well as the Standards Subcommittee. I have no conflicts.
MS. MILAM: Sallie Milam, West Virginia Health Care Authority. Member of the Full Committee and this subcommittee. No conflicts.
DR. PHILLIPS: Bob Phillips, Vice President for Research and Policy at the American Board of Family Medicine. Member of the Full Committee, this subcommittee, and the Population Health Subcommittee. No conflicts to report.
DR. RIPPEN: Helga Rippen, Health Science South Carolina, Clemson University, University of South Carolina. I am a member of this committee, the Full Committee, Population Health and the Data Workgroup. I have no conflicts.
MS. KLOSS: Do we have any members of the Full Committee or the Subcommittee on the phone?
DR. STEAD: This is Bill Stead, Vanderbilt University. Member of the Full Committee, co-chair of Pop Health, no conflicts.
MS. KLOSS: Thanks for joining us, Bill.
MS. HINES: Good afternoon. I am Rebecca Hines. I am with CDC/National Center for Health Statistics. I am the Executive Secretary for the Committee.
MS. SEEGER: Rachel Seeger with the Assistant Secretary for Planning and Evaluation.
MS. SANCHES: Linda Sanches, the Office for Civil Rights.
MS. KLOSS: Again, thank you, and we’ve had an amazing two days.
A little change to the agenda for this afternoon on the timing — my fault. I have a 6:40 flight out of Reagan and it’s my last opportunity to get home tonight given my route and plane changes. So with the TSA issues, I’ve been advised that I probably better not sneak out at 5 o’clock like I might have done a year ago. So I think we’re going to shoot for adjourning at 4, and I don’t hear any great unhappiness with that.
Hopefully some who have later flights will have a chance to at least catch up on email if not get on an earlier flight. So I appreciate your consideration in that time change.
And I’d also like to reserve 20 minutes to review the draft agenda of the minimum necessary hearing that the subcommittee’s going to conduct on June 16, especially because we’re all together and it’s kind of precious to have everybody’s eyes on this. We passed that around. You can set it aside for now, but we will just take that up shortly before 4 and use our time together to help us refine that next agenda.
So that roughly gives us a hard hitting hour and a half and I think we can get a lot done in that time if we’re really focused.
What we did when this subcommittee held its hearing on section 1179 last year was we had this couple of hours after and we used it to construct a high level summary report that we would present to the full committee in June, which we’ll want to do, but that technique of pulling that together you can appreciate forced us to kind of consolidate what we’ve learned into some categories and it gave us a takeaway from this meeting that then we can continue to refine.
Actually, that work we did in constructing a PowerPoint through this kind of brainstorming process became the basis for a letter to the Secretary.
So we will find out how far we get. So Rachel and I have — and Maya will be back shortly and she’ll work the PowerPoint on our behalf — but we chunked out seven categories that we would like to pull together what we learned into.
The first one that seemed like a real overarching theme that speaks to why this is an important topic now was the reference that several made to the privacy data collision or however it was phrased. I mean, it was clearly a way to summarize kind of the urgency of this topic and why our timing in considering it is good. So I think we will take that up and kind of envision that as a PowerPoint slide and we’ll work together to pull our thinking on that.
The second chunk that we think we can organize our thoughts into is the current state of de-identification. So we heard about issues on re-identification, we heard about the provider prospective, we certainly heard a full range of what’s working, we heard about the cost of the expert approach. I mean, these are all issues that I think can be bulleted under that.
The third — so, Maya, the first slide will be titled — the third will be developing research. There’s just a lot — we just heard a whole rich set of work going on, not all of which is far enough along to inform us, but which I think we’ll want to capture, kind of catalogue. So that one’s probably going to be spilling over to multiple slides.
The areas in which further guidance were suggested is a need and I think there were a number of pretty specific areas that were suggested to us. Our fifth were use cases where current de-identification holds up, where it doesn’t, and that whole recommendation that came out that seemed to be also recurring. The next one that we had was life cycle management process, the incorporation of more robust process management and process controls.
So I think if we start peeling these back, we may find that we need to break it out differently, but at least that seemed like a starting point for organizing our thinking. Does that make sense as an approach? I just combined — I think the seventh was controls.
MS. HINES: The discussion around what is harm and what is risk, does that go under developing research or should we have like a bucket for questions?
MS. KLOSS: Well, then we had two other categories that weren’t maybe substantive, but what we don’t know enough about — so there may be gaps in the testimony we heard and we need to do some additional research. We thought we wouldn’t do that until we peeled through the rest.
DR. PHILLIPS: Linda, where does this discussion of technology versus policy fit? Is that — it came up yesterday and today.
MS. KLOSS: Yeah, maybe that needs to be a separate category. It was a really important one. Why don’t you put that after current state, the balance of technology and policy?
DR. MAYS: Linda, what about the translation issues? I mean, I think that was critical about this issue of moving things from science to actual practice. I mean, I think there’s a lot of things that may come under that in terms of at least recommendations that I think we —
MS. KLOSS: Does it roll up anywhere? Could it?
DR. MAYS: No, because it’s like, it’s research that’s already done, so I thought about there is not life cycle, it’s not —
MS. KLOSS: It’s getting everything out. Would it be education then, or application?
DR. MAYS: Well, translation into practice. So application, translation to practice, one of those.
DR. RIPPEN: Because there was a gap between what we know versus what is being done.
MS. KLOSS: And then we thought that the final question we’d ask this afternoon is based on pulling this all together. Are we really ready to write a letter or do we have — is there another hearing that we need to have or whatever? And we’ll leave the do we know enough and are we ready until the end.
MS. HINES: The population health subcommittee is my first example of that as being part of this. They went and did an environmental scan because after their last workshop, lo and behold, they discovered there was a whole area they didn’t know anything about.
DR. PHILLIPS: Could I ask one other thing? Where does genomics fit on this list? Because it’s such a different animal and it’s not classified right now.
MS. KLOSS: Well, you know, do you hold it separate or do you thread it through?
MR. COUSSOULE: I might look at that as just a kind of explosion of just different kinds of data availability of which that’s a very big one.
DR. RIPPEN: And then what is to find or not to find and whether the implications of that under those 18 identifiables because that was pretty wild.
PARTICIPANT: Well, just the dispute of whether it falls into the current de-identification standard or not. So maybe there’s a special consideration under each of these.
DR. MAYS: I just want to raise this issue because we’ve kind of pulled out research and to me, the thing that kept going on with difference between what’s research and what’s commercial because the research really does have a lot of protections in it.
MS. KLOSS: Well, what I meant by this and I think that you’re — and maybe there’s a separate discussion around research. What I was thinking is we heard a lot about really good work going on that’s going to inform the de-identification.
PARTICIPANT: Are you talking about data science research?
MS. KLOSS: Yes.
PARTICIPANT: That’s not what Vicky is talking about. You’re taking about research using the data.
DR. MAYS: Right. That’s what I was saying. So she had a different idea about it so that — I was saying that. And then when you do life cycle, I guess I had a broader — is to go up to data stewardship.
MR. COUSSOULE: I think research falls into the use case because it gets into the data consumption side. That’s how the data is going to get used.
The one thing in my mind that might be missing is the consumer’s perspective in this, the individual’s perspective.
DR. RIPPEN: And I think that’s where, also — and again, it fits under many of the sub-bullets, the question of the transparency component, provenance. To some degree, that could be part of the life cycle. Then as we mentioned before, the intent. So all of those kind of nuances, they could actually fit under multiple.
DR. MAYS: Yes, like intent, value, and harm, I thought were significant.
MS. SANCHES: I have a suggestion for clarity. There was lots of conversation over these last two days about de-identification requiring a sort of a suite or a tier, a set of different kinds of methodologies and the current de-identification standard requires stripping and then there are no data stewardship requirements after that.
So it might be useful to think about whether your recommendations are for regulatory changes or more requests for the Secretary to provide more guidance because those are very different kinds of things. So that could be something to think about.
MS. KLOSS: So this is just one way we could proceed to feed the elephant.
Does it make some sense? Does it make sense to try to move things? I mean, I don’t think we want to go back through each set of testimony and —
DR. RIPPEN: What might be helpful, I don’t know, but sometimes what were the key learnings that you think — what are the key messages? And then everyone say what did you hear based on the testimony? And then figure out, did it fit into these buckets also? Might be reinforcing, I don’t know.
MS. SEEGER: I think it would be helpful to walk quickly through the agenda and think of the main takeaways from your notes and then see if we can plug them in. I can go up and work the board.
MS. KLOSS: All right, then. So we’re back to yesterday morning at 9 o’clock, Simson Garfinkel.
Well, one of the key takeaways, the first key, every dataset has different de-identification challenges. Now, if I were using the other frame, I’d put that as the current state of de-identification or I guess it could go under the privacy.
DR. MAYS: Maybe what we can do is go back and forth and have Maya fill this in under the buckets we already have and then if something is different, keep that list and then go back and redo it.
MS. BERNSTEIN: We can do that. If I do it in PowerPoint, it’s just going to get smaller and smaller.
DR. MAYS: No, no, no. Go back and forth. Yeah, and then you can make each one a separate slide.
DR. RIPPEN: I guess maybe it’s the first bullet that you put under, that you kind of called out. I guess I was struck by the fact that although things are de-identified, quote unquote, per the whatever the requirements are, they are really no longer, they can be re-identified in a way that — in a pretty significant way.
So even if you release it, it’s no longer — so actually, Linda, it’s kind of what you were alluding to.
MS. KLOSS: Well, it’s a little different, that each dataset has different challenges.
DR. RIPPEN: Yeah, this is that is de-identification sufficient?
MS. KLOSS: And then that de-identified datasets don’t stay that way.
MS. MILAM: I think we also heard that to try achieve de-identification, we need both technical and administrative solutions, and we move the needle on both of those depending on the data and the recipient and the scenario.
But we also did hear Brad and Dan speak very strongly to the fact that we don’t really have a good study of safe harbor, where it was ethically conducted to see the efficacy of removal of the identifiers because most of what we have out there was reported directly to the press instead of going through a peer review process, and so there was some discussion of the need to really evaluate the technical aspects of the safe harbor standard.
MS. SANCHES: I was looking at my notes from Simson’s testimony and your questions and there seemed to be a lot of discussion of the need to improve practice and training in de-identification, that there’s a lot of science in this but not much training, and some questions around creating policy incentives to increase the application of better de-identification and perhaps Sally mentioned maybe tool kits.
I think that came up a few times as to whether there were tools that could be used and how — but then how we would want to assess the tools, how would we know if it was actually doing a good job, those kinds of questions were coming up.
DR. RIPPEN: And building on the tools, I think the whole question of can different tools be made available that also inform what you may or may not lose from a data fidelity perspective.
MS. SANCHES: Yes. There was — it came up a few times that synthetic datasets could be a really good method except we were cautioned that that actually may not really work from a public health perspective, that there were some concerns about the usability of synthetic data which I couldn’t really assess.
DR. RIPPEN: And I guess there’s a differentiation. If one — so here’s an interesting question. So if we really truly de-identify, in the true sense of the word, and again, many people brought up the question of not allowing re-identification, but if we have provenance, we would never really have to really re-identify because you can go back to the source to say that there’s an issue, find it.
So I guess I’d like to separate out the distinction between, oh, I have to know how to re-identify because I want to get that alert versus going back to the people who have the data to — whose role is it?
So there might be a question of how does one address some of those balances because there may be multiple ways. I don’t know if it’s a major theme, but I know that that at least addressed the balance of can I re-identify or should I and what’s the other option?
MS. SEEGER: So back to Linda’s point about the usability of synthetic datasets, there were four points that Cavan made. The semi-trusted analytical sandbox, which is the enclave model, the synthetic private dataset — synthetic differentiality; a synthetic dataset is fine, but these are different points. Not just synthetic datasets, Maya.
The first was the semi-trusted analytic sandbox, which is the enclave. So that’s a separate bullet. Secure multiparty computing and near real-time business data in addition to that provenance. I think the takeaway there is that there are models.
DR. RIPPEN: Of use of data. So I guess all of those would be — so I would say that this might be really around what are options for allowing use of data and mitigating risks re-identification. So those were the sub-bullets under that. So put that at the top.
MS. KLOSS: Again, in our first panel, we had observations including that the current science is much better than the practice, that we need to focus on spreading the science we have, which I think certainly could lead to a set of recommendations.
DR. RIPPEN: You want to translate the — you want to have —
MS. KLOSS: Well, not only translation and depression, we want to spread the science that we have. We need to step — we need to raise the sophistication of current practice with what we know today. There are other new insights coming along, but we don’t have a way.
DR. MAYS: Well, if you talk about it as translation, it fits very well into kind of funding sources and what have you and it can be tacked right onto the individuals that are funded to do this.
MS. KLOSS: Well and I’m thinking not in the research context. I’m thinking in the provider world and what we heard today from about how cumbersome it is to just do de-identification in a healthcare organization. Move all the data out of the EHR, move it into a —
MS. HINES: That was something we heard, was that it’s actually the cost benefit analysis for some was articulated as not worth doing the research so there’s actually research not being done as a result of the current practice, is what I heard several people say. Some pretty important research is not occurring because of the fears around risk mitigation and re-identification.
PARTICIPANT: And also there’s not enough workforce.
MS. KLOSS: You just capture these thoughts, Maya, and I’ll put them back in the categories later.
MS. BERNSTEIN: Well, I don’t want to read your mind about where you meant to go, but also, if I keep adding them in the same place or one place where you can see them, it’s going to get too small to see.
So Rebecca’s point was that some research is some impeded by the fear of —
PARTICIPANT: The risk as well as the cumbersomeness and the expense and the cost.
PARTICIPANT: And what does cumbersomeness mean?
MS. SANCHES: You know, expensive, administratively difficult.
MS. SEEGER: What I heard is that it’s expensive which I think is different to do the expert determination. They basically said it was too much money and then they didn’t try to do it.
DR. PHILLIPS: It would be expensive. It would cost the research budget if you went with expert determinations.
PARTICIPANT: I heard that the electronic record system was the cumbersome piece of it because they have to export it and manage the information.
PARTICIPANT: Well, what I heard about cumbersomeness was the fact that there were only five people nationally who were experts and it would take months to get them involved in addition to cost.
PARTICIPANT: But cumbersome implies to me that the process is hard when actually they’re saying they don’t have the — they don’t want to hire the expertise in house and it’s expensive to buy it.
MS. SANCHES: But the alternative is cumbersome, which is going out and getting 5,000 or 10,000 individual consent signatures. That’s the part that’s cumbersome, or they can go to an IRB if they can get their IRB to agree.
MS. BERNSTEIN: A lot of times it’s for business purposes.
MS. KLOSS: But we heard just stripping away the 18 fields was a fairly significant process. Extract the dataset from the EHR, put it in, create a new dataset, then strip it. They have been using — I talked to her offline. They typically just use like a SAS program or something like that to strip away the identifier. So this isn’t even getting to the expert approach. This is applying the safe harbor.
DR. RIPPEN: Yeah, I think the hardest is getting data out. But everyone has to get it out for CDW. So stripping the identifiers. I think the challenge that I have experienced from others is that a lot of the data — some of the attributes have value. So age, for example, right? Sex, and certain things. So to have to figure out what do you have to scramble and what the size is for that population is the expensive part. You know, if you want to keep some of those components.
MR. COUSSOULE: I think I might even back it up just a notch. I think we are getting way down in the weeds of one example, and it may be — I think there are certainly complexities involved in making the data available for research, complexities and barriers involved in that, which could be expensive. It could be time. It could be losing the utility of the data and all of the above.
I think if we start off with that kind of what we heard which is in order to create the ability to use the data, it’s time-consuming, expensive, hard, lack of resources. All those are just components of the fact that it’s hard to create that set of data to do that analysis.
DR. RIPPEN: I guess I just want to be balanced. I think then the other is that even de-identified, there is — you know, we haven’t captured someone else somewhere else, is that the risk of re-identification is real and increasingly a problem and that going back to then — it goes back to the intent of business and the balancing having more than just the technical component but also the procedural business processes of what is allowable or not allowable, you know?
DR. MAYS: I thought that was a big deal, especially when Jacki was talking, because for California there were many of the rules that she had to deal with, and for her, because I was listening; at first I was like, no, it’s not that — and I realized, it’s the procedure by which — you know, she needs I think more guidance about how to be able to do this more efficiently. It’s doable, but that’s I think what she’s looking for is guidance.
MS. HINES: She also said she felt like she needed someone to back her up when she was dealing with other outfits, other companies. They said, well, your standard is more strict than anyone else’s, because there isn’t anything from the federal level to support her. So she was looking for some federal, like, yes, this is appropriate to ask for this language to be added to the business agreement.
DR. RIPPEN: I think we also have to figure out the balance between effort and value, right, and because it’s not that long ago that the costs of actually getting any of this data was at least 100-fold more, and the availability, and this is when we had many of these guidance principles come into play.
So now I guess the question is what’s the balance and what’s considered a high level of effort versus not with the balance of being able to secure and safeguard sensitive information. So again, I think that we have to readjust a little bit kind of where we think things fall as far as what’s burdensome and what’s not.
MS. KLOSS: We had started walking down the — started at the beginning and kind of walking through the testimony. Could we go back to that? So any other takeaways that we haven’t captured from Simson’s testimony?
MR. COUSSOULE: One point that he brought up, I think, was deciding a little bit of what should be private versus public, and I’m trying to think through the discussion. It was more about if we looked to the use of the information. Is it publicly available, or is it more limited use to a well-defined set or partner or set of partners that live with a little more restricted use? I think that does get into a little bit of what’s the rigor by which you need to put into the identification/de-identification process to offset the risk.
He also talked a little bit about the risk — the utility side versus the — which we brought up a little bit earlier, but how much utility you get versus the amount of restrictions you put on.
MS. KLOSS: So we will move into Brad Malin’s testimony. He raised a new point about the special difficulty of de-identifying unstructured data, narrative data, and new issues regarding conversations in portals. That was the deacon — her husband is the deacon of the church example. And the question for us is that’s where risk started.
MS. BERNSTEIN: Do you want to summarize that thought?
MS. KLOSS: I thought I had. The complexity of de-identifying narrative data or unstructured data. He said risk means something different to everyone.
DR. RIPPEN: He also I think talked about contractual agreements.
MS. KLOSS: He did. He posed the question who’s responsible for integrating multiple certifications. He was talking about the expert method. So if you got multiple datasets that have been certified by experts and then you bring those multiple datasets together, how do you integrate when they were done using a different set of techniques?
DR. MAYS: Wasn’t he also the one that wanted to see a coalition? It’s like he’s the one that wants the people to talk across —
DR. HINES: That was Ira. Yeah, Ira has really nice written testimony, and he calls that one scientific discord where you have these two camps and they don’t seem to sit down and talk. Formalists versus the pragmatists, that’s what we’re talking about?
PARTICIPANT: Yes, I think Bradley also has it in his recommendations.
MS. KLOSS: Well, he made — Brad made a number of statements and recommendations around the need for — the observation that there’s currently no oversight and that we need some sort of oversight on de-identified data, that there shouldn’t be full exemption from HIPAA once something is de-identified.
PARTICIPANT: I believe he said it was ludicrous. That was in fact the word that he used.
PARTICIPANT: So we’ll have the ludicrous recommendation.
MS. SANCHES: I think he and someone else — I can’t remember now who — but a couple of people talked about the need for these things to be time-limited, both in terms of saying that a dataset is de-identified and also from a regulatory perspective that guidance should come out regularly as technology and risks are changing.
MS. BERNSTEIN: You are talking about time limitations for a particular expert’s certification?
PARTICIPANT: Yes. Or even, I think, safe harbor. I’m not sure.
DR. MAYS: Bradley also alluded to the role of the IRBs, and so I think throughout this we probably should have a sense of when they should be a part of this, because otherwise we are going to find everything else will move along, and they are going to keep stymying this. So I think one of the things we need to think about is IRB in terms of education and training as well around privacy, because that may be part of what is problematic in terms of some of the science stuff, too.
MS. SEEGER: Brad said that IRBs don’t feel comfortable because there is no guidance.
MS. SANCHES: He and Daniel, I think, talked about needing some sort of agreement on what is small risk.
MS. KLOSS: And also the need for a clearinghouse for best practices in de-identification.
DR. RIPPEN: And also contracts, the question of some of this could be contractually required so that now you have made it civil instead of just as far as violation.
MS. KLOSS: That’s getting into Ira’s. So, Ira had the whole set of recommendations on data release policy and process for minimizing risk, similar to data security policy.
DR. PHILLIPS: He also talked about the statistical disclosure limitation tiers that came up again today. I actually heard someone talking about query-based systems today, mentioned by Daniel. But different access levels to the data.
DR. MAYS: I can’t remember if was him or before him that just reminded me when you said it. There was the suggestion about we have no standardization about suppression. So one dataset, it may be 3. Another dataset, he said, might be 5. It could depend on the data, but I think better guidance about suppression, cell suppression, could be useful as well. I’m not sure exactly under which that goes, but —
MS. BERNSTEIN: I am just going to put the first panel together, because it doesn’t matter really where it came from as long as we sort of know.
MS. SANCHES: I am not sure if you already have this, but what I find sort of remarkable is that Ira talked about the four tenets of reasonable data security. Axiom has their own tenets. HITRUST also had a different way. So they didn’t use the same language and I’m sure their processes are different, but they all had developed and were using a particular way of examining risk, value, and minimizing risk.
MR. COUSSOULE: I think that the challenge with that, and I think we have heard from each of them that all sound like they are doing good stuff, but the point brought up is, okay, where is that actually codified in any kind of policy. So now all of a sudden we’re in economic risk as a company, so now we are going to sell this data we didn’t used to sell before, because policy has now become inconvenient. I’m not saying that they would do something like that, but there is a very different risk depending on who is sitting in what chair and what the stressors are, because there is no guidance that says you can or can’t or should or shouldn’t.
To me that’s the policy implication of that one. If you are depending on people being ethical, and the vast majority of people will be and they will make good decisions, but then we all know the stressors of running a business and how that might work and then you start stretching that boundary a little bit, and without some clearer guidance around good practices or risks of doing that, I think that’s where the challenge comes in, risks, consequences.
MS. SANCHES: I believe Brad said something on that point, developing economic incentives to limit risk through deterrence. That’s how we started talking about data enclaves.
MS. SEEGER: I think we heard that from a couple of different people.
MS. BERNSTEIN: Basically they were talking about the shift of a focus from preventing harm to minimizing risk, but harms are hard to measure. They differ among — depending on an individual’s particular proclivities, what’s harm to me might not be harm to you. Although I think they wanted research on that, didn’t they? But the idea that policy should be — it’s hard to make policy on that for those reasons and the idea — I’m just looking through my notes of — you know, to focus on the process of minimizing risk, knowing that it’s never going to be zero, rather than trying to prevent harms which are hard to measure and differ for personal — you know, by individual.
DR. RIPPEN: And then I know this was just brought up, was the question of not allowing to re-identify, and I know there’s some — I mean, I think it’s not popular, but it’s a question of the guidance is not re-identify or link and they have to do security.
MS. BERNSTEIN: You mean putting into your data use agreement something that says you may not re-identify?
PARTICIPANT: You know, I think it’s important to differentiate when we think about the risks from data that is accompanied by a data use agreement and data that’s just out on the web, because when we hear about the risks of these data in terms of the technical de-identification, I think it’s for just data out on the web. I don’t think we’re hearing — that’s not what you heard?
DR. RIPPEN: No, because actually we require that any data use agreement, that no one is allowed to re-identify, because sometimes when you do the risk, especially if you do statistical de-identification, you have to assume that they are not going to try to re-identify. So it actually is — I think it happens on both sides.
MS. MILAM: Maybe I am not being clear. Maybe we just see it differently. I guess what I’m saying is I think a lot of the data that’s shared that is a limited dataset or could be de-identified, that is accompanied by a data use agreement. Those data use agreements usually have prohibitions on re-disclosure and prohibitions on re-identification.
MS. BERNSTEIN: There’s nothing that requires them to have that, though. So it’s only the good graces of your paying attention chief privacy officer that put those in those agreements.
MS. MILAM: But the concerns I think mostly that we are hearing where we don’t have that administrative control on their re-disclosure and the re-identification is generally data on the web, because we can — if we choose to, we can control that by a data use agreement. So I think — we can control the risk of re-identification with a data use agreement to a large extent.
So I think when we hear the concern, the stories where reputation has been eroded where people have information that was shared that shouldn’t be shared, I think it’s mostly from data on the web that is not accompanied by a good data use agreement.
DR. MAYS: Sally, I think you are making the assumption about what the data use agreements are like, and it’s like just this meeting alone I’ve learned so much from you all, and I can’t tell you how many data use agreements I have signed to get data from entities, these national datasets. A lot of the things that have been discussed here today do not appear in those data use agreements, and these are major entities. I’m not going to call names, because we are live. But they are major entities.
The issue is we are usually told as researchers that if we identify anyone, we just have to let them know, but combining data, I have no restrictions for some of that. So I can take those datasets and I can do other things that would, from what I have learned today, increase risk.
MR. COUSSOULE: I think if we took a step back and said that you can clearly build more constraints and restrictions and protections when there is a data use agreement, they are not always as disciplined and diligent as they could be, but there’s a much higher risk when there isn’t a data use agreement.
DR. MAYS: I think that is what we need to recommend is that I think you all don’t realize how there’s very few people who are as knowledgeable about this as you think, and it may be this is what we need to do is talk about best practices in building a data use agreement that would prevent these things that we are concerned about and say specifically some of the things that should be in it.
MS. KLOSS: One of the other I think really important points made was Daniel mentioned that it’s always a tradeoff between information quality and privacy protection, and we just — that kind of goes back to the whole data management, but I don’t think we think of it. We don’t think of that tradeoff so much.
MS. HINES: Going back several points ago, yesterday Barbara had talked about for research purposes when clinical data are de-identified and then you find something and you need to let the patient know, for instance, that they have cancer, Bill Stead mentioned in an email that where he works they do not re-identify the data under those circumstances. Instead they use a classification algorithm developed through use of de-identified data, apply it to all the identified records, and then contact everyone who came up positive to come in for clinical testing. So just want to make sure that’s acknowledged, as well.
MR. COUSSOULE: That is actually not an uncommon model, because it’s one thing to say I’m going to target that one individual or that may have been a factor, but there could be lots of others. So I think that model is in use. We have done similar kinds of things.
MS. KLOSS: And that protects the intent of de-identification, too. We also heard that what — we need a current definition, a workable definition, of what re-identification is.
MS. MILAM: Somebody talked about the difference of just matching uniques versus correctly matching unique to the correct individual. Somebody spoke to that.
DR. MAYS: Somebody also talked about the notion of the apply system type approaches to privacy issues. Treat it as a system problem rather than an individual.
DR. COUSSOULE: One other topic that came up was the distinction between the use and access of the data by covered entity versus a non-covered entity. To me, that gets into the policy implications of saying who is covered and who is not and what rules do you have to follow with regards to how to protect data. Right, because if you de-identify the dataset, there’s no restrictions on what can happen to it then.
DR. MAYS: The data comes out of covered entities maybe should be revisited annually. So in terms of looking at the agreements, making sure that they work.
MS. BERNSTEIN: We have people on the phone, including Gail, our staff member Gail Horlick from the CDC, who wrote me a note saying she’d been listening to the whole hearing, and they can turn up their volume and make us get recorded, but the people on the phone are going to have a hard time hearing if you don’t talk into the mics.
It occurs to me, if you want to pause and see if any of our listeners are — while you’re thinking.
MS. KLOSS: Bill, any — or others on the phone? Any comments or questions as we continue through the learnings?
DR. HORLICK: This is Gail. No, everything that’s been stated are things that I heard, for as much of the hearing as I could attend. Thank you.
DR. STEAD: What you are saying computes from my perch. I particularly like what I see as a framework of guidance, such as data use definition, regulatory change such as time limited safe harbor oversight and evaluation and option for hearing with mitigation. I think you are on a good track.
DR. MAYS: One of the other things that came up is that there were no standards on the monitoring of apps, and that was when — and I think we have talked about that in the data use and access group in the sense of are we going to at all think about apps and some of the issues with the apps. So that came up here as well.
MR. COUSSOULE: I guess the only other comment — and it’s similar to the last one there as far as the different rules and Brad made the comment directly, but the differentially regulated environments for exactly the same data. So it’s just another — it’s a slightly different twist on the same topic, again whether you are a HIPAA-covered entity or not a HIPAA-covered entity. You have different sets of rules, and maybe they are actually in conflict in some cases or certainly not at the same level of rigor.
I don’t know that we have enough information to know where all of those are, but can certainly comment on the differences and potential implications of those differences.
MS. BERNSTEIN: So you are saying that depending on the environment that the same data might find itself in.
MR. COUSSOULE: He talked about that the laws are generally sector based, that we have in place, and that if you go into different sectors, you conceivably could have virtually the same information not treated the same way. Same level of controls.
MS. KLOSS: And we also in that first panel discussed — began to discuss the specific considerations relating to social media and genetics.
MS. SANCHES: I think also the first panel was the beginning of the tension we heard throughout the first day around whether de-identification is worth doing given changes in technology versus others who thought it was actually useful and versus those who think it’s useful and we should do it much more and in a rigorous way.
MS. KLOSS: I think what I heard was useful but not sufficient.
MR. COUSSOULE: I don’t think anybody said it was worthless, but they definitely said if you expect it to be the panacea that it’s not.
DR. RIPPEN: I think that what maybe the message was that also then the solutions might be such that you would have maybe things not de-identified in the same way, but you would have other restrictions.
MS. KLOSS: Let’s move on to the second panel. Clearly we will have many more passes through this, but here we had introduced the full concept of citizen scientists, not covered by HIPAA, and new models.
DR. RIPPEN: There was the theme of the ability for control, how and when information would be used, which actually touched upon some high level things that were covered throughout the two days around multistep process for consent, which was then, you know, as the overarching theme.
DR. PHILLIPS: I thought Michelle also flipped that to say that individuals also want more control and use of their data because they want access to it. They don’t feel like they have access to it as it is. She talked about the need for flexible data sharing protocols but with government oversight and vigilance about new ways of identifying. It was a need for more authority to control use of their data, not just to protect it but to actually access and share it. It was not what I was expecting to hear.
MS. SEEGER: This was the CDT’s testimony. So it was speaking to patient-centered SAS and freeing data so that patients could have more control over it. It was a little off topic to de-identification.
MS. BERNSTEIN: Maybe. If they — I can see how to connect something like that, if you have more control over your data. If you are allowing it to be de-identified and can’t ever get back results from — sort of seems like it’s sort of related to some of the stuff we did back with data stewardship where the idea of what are you feeding back to the community that’s giving you the data? Are you giving them insights from the data that they have provided to you so that you can publish your paper and get your tenure, right?
The sort of helicopter idea that we heard about that it would be nice to have some sort of kind of feedback loop where citizens or subjects can get back either individually identifiable data or at least results of what happened with the data that they contributed.
MS. KLOSS: She also suggested that we need more robust risk assessment tools and process to be applied to analytic databases.
MS. BERNSTEIN: I was struck by how many different people mentioned — I mean, maybe I wasn’t struck by it, but many different people did mention about the possibility of analytics to reproduce existing patterns of discrimination or — so this is something that comes up as a theme in other places too that’s been reported, but it came up as a theme in multiple different pieces of testimony that we heard over the last couple of days. Hidden bias. Yeah, hidden biases, but also the idea of propagating biases. Both of those things, thanks.
MS. KLOSS: We will have to have some kind of theme or discussion around governance and ethics. We heard a lot about ethics. I found that really encouraging, because at the end of the day, that can be the touchstone.
MS. SANCHES: Jules talked about the fact that we are bad at talking about risk and we need to look at what value we want to give to controls. Do we care about statistical risk of identification, or do we care more about the risk of harm? And that was sort of the basis of his discussion of need for more ethical discussion. Do we care about the statistical risk of identification or do we care about harm, and how do we talk about harm? How do we talk about risk?
DR. RIPPEN: And that was in the context of the value.
MS. BERNSTEIN: You mean like our ethics? Our values as a society?
DR. RIPPEN: I think of the value to society, the value of what it is that you’re trying to do. So it touches, you know, harm but actually the benefits. It’s the other side is how I interpreted it.
MS. SEEGER: I took away a couple of other points during this panel. Technology alone isn’t the solution. Then there was a whole discussion around best practices starting with privacy by design, collecting only the data that’s needed, disposing of data as it becomes less useful. That’s kind of lifecycle stuff we have been hearing about.
MS. SANCHES: One theme we kept hearing was about auditing technologies, metadata.
MS. SEEGER: Contractual limits, contractual solutions and auditing.
PARTICIPANT: Well, and just how difficult it is.
DR. MAYS: I also thought the discussion that we had about harm in the second one, especially when they were talking about different types of harm, I think Sally brought it up, they discussed it, some of the distinctions between harm that had to do with reputational harm, harm that had to do with identity. It would help us when we talk about risk to talk to really try to get a sense of what type of harm we’re talking about.
DR. PHILLIPS: Cora Han had talked about, in her fair information practices recommendations, she talked about record linkage and mentioned the potential for FTC’s rules for data linkage to apply to HIPAA data. I wasn’t able to capture all the content of that. I didn’t know if it meant that FTC had some policies that might be applicable here or —
MS. SANCHES: The FTC in their report on the internet of things had made a couple of suggestions, and I’m not an expert in this, but from my notes, it called on entities to take reasonable steps to de-identify, to make a public commitment to not re-identify, and to enter into enforceable contracts to not re-identify. But this is just based on my notes.
MS. BERNSTEIN: Yeah, I have that the FTC test is that things are not reasonably linkable if you have put in place reasonable measures, if you commit to not re-identification, if you prohibit downstream re-identification, so that’s your subcontractors, basically, or anyone else you disclose to.
MS. KLOSS: She did indicate they were working on some kind of research around ethical practices also, and then there were a whole set of recommendations that came from the PCAST report, which I don’t think we need to reiterate on these slides, but will need to return to.
MS. BERNSTEIN: They went back to the White House. They did a report in talking about respect for context that came up there. So the fair information principles, but sort of added this idea of what a consumer can expect based on when, how, where the data was originally collected. So consumers have a right to believe that data will be collected, used, and disclosed only within the context where it was originally collected.
But within that context, a collector should be kind of free to use within that context, and then if they move outside that context, that might need another consent or some other method for moving outside that context using data differently.
MS. SEEGER: Both FTC and PCAST called for greater consumer transparency, and FTC went so far in their recommendations to suggest opt-in or opt-out.
DR. PHILLIPS: This is the panel where -omic data — phenomic, genomic data were as — being so specific that if it’s linked, you lose all anonymity. The other thing they talked about was — this is the first place that HITRUST came up, the de-identification framework that HITRUST has developed might be a next model for us, though I am concerned about its sophistication after today’s conversation.
DR. RIPPEN: One of the action items that we may want to consider is to kind of crosswalk the kind of different best practices just to kind of get a sense.
MR. COUSSOULE: One other topic that came up yesterday was the — when we talk about harm, the definition, but the distinction between real and perceived. Just because you have been able to re-identify something doesn’t mean you have used it in a malicious way or any harmful way. But the question about — gets into who gets punished, right? So if I am a data disseminator, do I get punished, or I’m the one that actually created the potential or real bad event, bad action.
DR. MAYS: Also on that panel, this is where the issue about machine learning and algorithms came up. I think this is where it gets to, Sally, the point about the agreements and I don’t think a lot of people understand some of the ways in which machine learning algorithms can end up identifying a person. So that should be written into those agreements.
MS. KLOSS: That was this morning. I have moved on to panel three this morning, and I’m looking through my notes here.
I thought that concept of separating the data from the uses, again, that’s been very recurrent, but the problem with de-identification is it assumes that you can just do something with the data set and that will persist.
MR. COUSSOULE: Can I go back to one other point yesterday? I’m just going back through my notes, and one that I think — I’m sorry, I don’t mean to be bouncing back and forth, but there’s one point that came up was when we talked about the kind of consumer rights side of that one and that if we were to frame it as when do we need to notify individuals versus when do we not need to notify individuals that their information is being used and is there a way to frame that up into policy guidance about these are what most people would consider rational uses so they’re okay versus they wouldn’t be okay.
I don’t know if there’s a way to frame that up or not, but I think that’s an interesting topic that came up to try to get around a little bit the — by the way, I have to go to every individual for every single thing I want to do with the data. Again, I’m not sure that — that did come up. That was sort of danced around, not the right choice of words, not that people were trying to get around it, but it was on the periphery of several conversations yesterday and this morning.
DR. RIPPEN: I think it also may be related to some degree by the multistep process for consenting also.
MS. KLOSS: Recognize what can’t be protected by technology. So I think there was — yes, machine learning has great promise, but technology has limits.
MS. HINES: Yes, Vitaly said important to understand limits of technology was how he summarized that.
DR. RIPPEN: And then the higher level of process of using the data in a privacy preserving way.
DR. PHILLIPS: I thought Vitaly’s point about technology moving faster than laws and policies and the need to have an evolutionary policy process so that you’re really trying to keep up with technology and harmonize was very eloquently said.
I’m losing batteries, so I’m losing notes.
DR. PHILLIPS: One of the things that Jacki said that I thought up front was important was that de-identification conflicts with data exchange regulations. So they are constantly having to negotiate that tension. But I thought most of her conversation was just so eloquent about the difficulty they have of managing. They want clear guidance. They want clear policies. They want use cases to help them so they can understand the policies because on a daily basis they are just struggling to manage the requirements.
MS. KLOSS: And she said that patients are comfortable with the use of their data when they understand.
MS. HINES: We may already have this, but this was where the discussion went about vendors and subcontractors. There is no audit process. She has 5,000 vendors. There’s language in the business associates agreements, but they push back because they are pretty — she tries to be pretty rigorous, and there’s no federal backup on that, on having some rigor in language in these business associates agreements.
MS. BERNSTEIN: What she had was a term that prohibited the businesses associates use of their de-identified information. She has a provision. She said she has a provision in her contracts that prohibits her downstream subcontractors from using de-identified information. But then she doesn’t really have the resources to audit.
MS. HINES: She also said that some of them would push back because what she was — the standard she was using is higher than pretty much whatever else is out in the public sphere, because there was no federal guidance. That’s where she said guidance and policy is needed.
MS. MILAM: I think what we are seeing today, all the business associate agreements that come across my desk now have provisions for the vendor to use their de-identified information, which we refuse to allow them to do, but it’s really become a business line for pretty much every vendor.
MS. KLOSS: But even if they don’t have a use for it today, they want the right in case that product opportunity may come later.
MS. SANCHES: She also raised something talking about weighing the benefits and concerns around re-identification and that there maybe should be some step of looking at whether there is a benefit to the patient and if so allowing re-identification, but that she hasn’t found any guidance on this.
MR. COUSSOULE: I think that falls right into what Bill commented on that you brought up before in regards to do I go right at the data source that I created the insight from, or do I then take that insight and then go back to the raw data if you will from there and try to figure it out.
MS. HINES: I don’t know whether she did slides or whether it’s in her testimony; I don’t have it at my fingerprints, but her four main categories I thought were really clear. She went through the current status of de-identification I think really eloquently about how the expert method is very expensive, the 18 identifiers are limited and this is where she brought up the use cases with Q&As for pop health, for precision medicine, for these different areas and needing to have room for innovation. So I thought she was very succinct and clear in how she laid that out.
DR. RIPPEN: And also very strong advocate for transparency to patients. Patients were her number one.
MS. KLOSS: Dr. Capps, he was the top of your list, the privacy and big data and collision. The notion of not keeping the identifiers with the data but using pseudo ID. And the common privacy architecture.
DR. RIPPEN: Given that de-identified data doesn’t fall under HIPAA, per se, it may be wise to think about how other government agencies address privacy of data and its de-identification also, because it goes back to consistency and especially given that all this data is now being linked for lots of different purposes, it’s just something to consider.
MS. KLOSS: Out of this panel we discussed more the layered or tiered approach.
DR. HORLICK: This is Gail. I just wanted to make a comment that this is something how the government agencies handle it that I get a lot of questions on, and so I think that the people are used to the concepts of cell size when they publish in something, but a lot of them are people just come to me, like in our clearance office, with a judgment call like if we publish this in the MMWR about a 7-year-old, you know, is this — how do you put enough information for the scientist or the doctor and still protect privacy, and a lot of times, I’ll make a suggestion, but I know that’s because it’s going through my office and I get asked a lot.
There was one where I really couldn’t identify it and then I saw — so it wasn’t even using a computer to link it. It was just, you know, when we do that. I don’t think there’s a consistent way at CDC. I think they have guidelines for what they publish and release in terms of sizes, but when it comes to so many other instances, I am repeating a lot of the best practices that I have developed, heard over the years. So I just think that might be a little bit representative.
DR. RIPPEN: And just think about the first person who contracts Zika in the United States with a mosquito that’s U.S.-based.
DR. HORLICK: Oh, we have a Zika registry at CDC now, but they went ahead and we — somebody I work with basically suggested they get an assurance of confidentiality. They didn’t even know what that was.
MS. BERNSTEIN: I think it is often the case in the United States that the first case of whatever it is, we know who the first Ebola patient was in the United States, and then there are several after that. We might not know who they were. But it may happen again with Zika, with resistant TB, and so forth.
DR. RIPPEN: There is a nuance also between what is transmittable by an individual versus not, too. So there’s some interesting nuances too.
DR. HORLICK: I would love to see — I mean, I sort of if I do a PowerPoint I’ll put up some bullets and make them think, but something from a national advisory committee that you could point to would be really helpful.
MS. KLOSS: We heard — of course we had lots of good discussion about genetic information, and I thought that insight that we don’t know what’s sensitive in it was really — yeah.
MS. BERNSTEIN: Do you mean that we don’t know what we will discover next year?
MS. KLOSS: Yes.
DR. RIPPEN: What about forensics, right? You know, if you have a sequence, you know color hair, eyes, et cetera.
MS. BERNSTEIN: No, but I mean he was making the point that we have these — we don’t know what we might discover five years from now that is in the genetic data that we have or genomic data that we have now that we can’t really make rules for how to protect it. So I guess he was saying we have to revisit this on a regular basis.
MR. COUSSOULE: That got into a little bit of the when you de-identify something. So again, we won’t know tomorrow. We’ll know tomorrow more than we’ll know today, but it will be different.
MS. BERNSTEIN: It also impacts more than one person.
MR. COUSSOLE: So I think the question about the policy guidance in my mind there is it has to anticipate that there will be more ways of identifying going forward and not fewer, and it will be more complicated. We are not doing a static analysis here, right? We are doing a point in time but recognizing that things will change. So make sure that you don’t set something up that looks really good in the rearview mirror but doesn’t look good looking out the front window, right?
MS. BERNSTEIN: Yeah, also I thought it was interesting that yesterday we heard it’s really hard to identify data and then we heard it’s not that difficult. Although, I have to say Dr. Erlich’s testimony, he wasn’t talking about — Linda and I were having a little conversation about this. He wasn’t talking about de-identified data of any sort. He was talking about stuff that people had disclosed that was in fact identifiable, meant to be identified, and how they were using it.
But I thought it was interesting how he talked about extracting things that are not in the safe — that are outside the safe harbor, a list of — you know, so if you took three ages and states in one family, the pattern of my 83-year-old mother in Albuquerque, me in D.C., and my younger brother in Oceanside, California, that group together, if all you knew was ages and those three states, could form a pattern that would identify my family, and maybe uniquely.
MS. HINES: Well, and then there was the surname and the Y chromosome analysis. So I think that gets to the — and what was the term for indirect or release of data where you could impute someone’s identification if you had their surname. Inference. It seems like that — I don’t know if that would be the purview of this committee, but it seemed like he was pointing to that becoming in the next decade could become an issue.
MS. BERNSTEIN: I think he was pointing out that those things that people are putting out voluntarily into the public sphere can be used by others to re-identify datasets that we think are de-identified, because we can use these other datasets in very creative ways.
MS. HINES: Right, but his point was it could be a family member four times removed that put it out. I didn’t put it out, but a tenth cousin put it out, and somehow you can still identify me through linking it to some other source or surname or something.
MS. KLOSS: I think we have an immediate issue. These data are going to become part of health records, and this is an area where guidance sooner rather than later will be helpful. That’s a good bullet point, Maya, I like that. Genomic data is part of EHRs. So it’s not going to really work to take off the 18 safe harbor.
MR. COUSSOULE: It raises an interesting question, at the risk of sounding maudlin, to some degree a so what. So if somebody is going to personally put out that kind of information and then all of a sudden it can now potentially create a bad scenario for a relative, how do you deal with that from a regulation perspective? I’m not sure that’s somewhere we really want to go down.
MS. KLOSS: That is why I was pulling it back to genomic data in EHR.
DR. RIPPEN: I would say that there is a nuance. I don’t think we can do anything about it or should. I think though making sure the people are educated to the implications and, again, going back to Q&A. Why does this matter or doesn’t? What are the implications is an important component.
MR. COUSSOULE: I think the risk that was already brought up which is the fact is there is more and more information that will be publicly available regardless of how it got there that would need to be considered with the datasets that you think are now de-identified.
MS. HINES: Wow, we should capture that. That’s like a perfect statement for the issue. Got that?
DR. MAYS: I also think that the issue of kind of building on what you were saying, the issue of education, whether or not that’s under our purview or not is that as we — we need to educate the consumers in particular about what happens in terms of the lifecycle of their data. So my concern is that when they say yes to something, they go to ancestry.com, they go to all these places, that we need to have education for them about what the possibilities are of re-identification with the possibilities or for ways in which the data can be — you know, like in the data brokers report. The ways in which their data can then be mashed up.
MS. MILAM: Is that an appropriate use for a privacy notice? Get beyond the regulatory and more into the educational function of it?
MS. SANCHES: That actually is something the FTC and the Office of the National Coordinator for Health IT is working on. HIPAA does not go to the actions of individual consumers.
PARTICIPANT: But this was an interesting thread related to education was the whole — importance of having patient advisory boards for de-identification, decision-making, really focusing on transparency, creating sort of consent processes that are illuminating. So there’s sort of a thread in there about involving patients that does not actually exist currently in the rule.
DR. PHILLIPS: He did talk about after the fact, though, that we should have at least trusted data sharing relationships with control, that patients should be able to opt out their data if at some point down the road they don’t like how it’s being used. He was talking specifically about genomics.
DR. MAYS: See, this is where also going back to the IRB and having the same kind of protections there you might want to have a patient involved in the IRB about things like opting in, opting out, because typically what we do now is we put a very broad statement that allows us to use the data to do other research, but it might be that as you come back, a patient sitting there saying no, that’s not good. So again, I think the IRB issue is coming up.
DR. RIPPEN: I think it’s kind of a broader than just IRB, because as we heard, research is pretty broad. There’s the traditional academic, but then there’s also the private sector research, and again, data mining and data massaging, and again going back to best practices and what would be an equivalent of assessing risk and benefit and what’s appropriate or not, the ethical constraints, and making sure they are aligned even with IRB, right? Because in theory, the ultimate aim is supposed to be the same to some degree, right? Beneficial. So that might be an interesting thing to think about, too.
MS. KLOSS: I think that the other — I guess I’m in panel 4, but —
MS. BERNSTEIN: Is there anything else from panel 3 that anyone else has before we move?
MS. KLOSS: The model that Dr. Altman shared, the framework model, the way of relating.
DR. RIPPEN: Catalogue of privacy controls was really useful in calibrating of controls. I felt like that brought a lot together. This was all part of the lifecycle approach to data management and select appropriate controls at each stage.
MS. SANCHES: I think Dr. Curtis talked about how the expert determination method had too much ambiguity, costs, and potential risks, and that’s why they use the safe harbor method, along with security measures.
MS. KLOSS: So, let’s focus on 4 and pull out our pearls from 4 and then talk about where we go next. I think your last comment, Linda, was Dr. Curtis in panel 3.
MS. MILAM: So in 4, I think we heard a lot about different standards and crosswalks to these different standards and how useful they can be with looking, managing risk, looking at de-identification, auditing, and contracting and a lot of other uses in privacy.
MS. BERNSTEIN: Aare you suggesting that this group should make a crosswalk among different standards?
MS. MILAM: No, we heard that they exist, and we heard that like Kim Gray is going to send us one where — well, Sheila Colclasure is sending us a mapping of NIST and HITRUST, and Kim Gray was going to send us the background on HITRUST, which takes the requirements from — she said, HIPAA, HITECH, PCI, and I didn’t catch all of the others.
MR. COUSSOULE: Are you all familiar with HITRUST and what they do in the security realm? I mean, that’s really what they have done. So they tried to synthesize a number of different security regulations because — we have the same issues today that we have to live with PCI regulations, with HIPAA regulations, with audit rule regulations, et cetera. So they have tried to create effectively a crosswalk so that you don’t have nine different ways to respond and nine different kinds of audits and issues that way to try to synthesize that into one common framework.
So they have tried to do that for security. It overlaps with a lot of the stuff that NIST has done, et cetera, but they are trying to create the same kind of model for privacy in this sense, and there are in my mind a number of real challenges with that, but without getting too far into it, the idea of having a framework by which you can kind of go by I think is a pretty good idea, and that came out, but then how do you get that implemented across a large system with very different kinds of players? Becomes a big challenge.
So to get back to the security side, there isn’t very prescriptive policy guidelines to thou shalt do this, but they have best practices that you should follow, and by the way, if you’re bigger and have more resources we expect you to do more of them. So it gets a little bit fuzzy from an actual policy guidance and then an enforcement perspective, and it’s not kind of a body recognized by the federal government as a standard that’s accepted. I’m not saying it’s bad. I’m saying there are some challenges with the model that we didn’t really get too far into today.
MS. MILAM: I think we heard some values placed on consumer expectations and I thought it was really interesting what Axiom looks at in terms of a privacy impact assessment. They look to see if it’s legal, just, and fair, and when I think about consumer expectations, that could probably describe that. When somebody thinks about what is an appropriate use of their data, if it’s legal, just, and fair, you would know that they wouldn’t be surprised by it. You would know that they wouldn’t be upset by it generally. So that might be a lens to consider.
DR. RIPPEN: It was interesting who was on the group, because there were more technical than not. The other thing that I think they also did was reinforce kind of what the other speakers had kind of alluded to, that it’s more than just technical. It was policy administrative, too. So you need it more than one.
MS. KLOSS: Sheila called it in this context governance and accountability against a set of social values. I thought that was very clear.
MS. SANCHES: Dr. Altman had some interesting recommendations to help those who were not — he said were not going to have a PhD on staff, and he was suggesting that there be provided a catalogue of controls, standardized licensing agreements regarding the reuse of data, build in expert panels to develop guidance, and design for interoperability. The problems of applying de-identification to sharing of data in an interoperable system came up several times.
MS. KLOSS: One more scan across the table. Any other words that need to get up there in this first round?
DR. RIPPEN: I don’t know how relevant this is, but it is actually very interesting observation, the shift between what the — where the data source is from. Now it’s going to be 90 percent observational, and that has some pretty interesting privacy nuances potentially associated with it. It’s a social — it’s actually all these things with us, with our kids, with this, with that, that are health-related but also behavior related. That’s where most of the data people are going to be mining is going to come from.
MR. COUSSOULE: One other point I brought up or took away from this morning; when we talk about the controls, that calibrating the controls would matter based on either it’s a different level of rigor based on risk or the potential level of harm, and then how you do that. So it’s not a one size fits all kind of model.
MS. KLOSS: Okay, so what do you want to do with all that? I am thinking Maya and Rachel and I could do that offline. We’ll take this and try to group it by theme.
MS. BERNSTEIN: You want us preparing a summary for the full committee meeting in June?
MS. KLOSS: Well, we will, I presume, have to do that for our committee report.
DR. HINES: I am trying to remember — I think our due date is the 7th for materials to make it into the agenda book. If you can’t make it by then, that’s fine. We’ll just hand it out on the day of. We will not make a commitment yet, because we have other things we have to work on, too.
DR. PHILLIPS: I’m sorry, this is just coming back to me, but Dr. Capps and the American College of Cardiologists both had a point that we didn’t capture, and that is longitudinality and linkages. So one was about linkages to mortality and other outcome data that’s an important identification need, and for ACC, it was about longitudinal linkages so that we can track outcomes over time. So these are important purposes of re-identifying our linking data that we don’t want to mess up.
MS. KLOSS: What is your feeling now about whether we will have enough to start framing a letter. I certainly think we have enough to describe the current state and the issues and probably at least first — there were a number of very practical recommendations that were presented to us that could be shaped. I mean, I think there’s probably a second and third stage, but I don’t know that we need to wait until —
DR. MAYS: I guess for me it was the issue of risk and whether or not we need to hear more about that, because that seems to undergird everything. So I don’t know. I guess I feel like it’s a little premature without a little more about the risk issues.
DR. RIPPEN: Or at least framing what’s included or not so that you can. So what I don’t know is whether or not — I don’t think we would be ready by the meeting.
MS. KLOSS: I think the way our work plan calls for consideration of the letter by the full committee in September — a lot of committees have ended up going to the next one. So that could be November.
MS. BERNSTEIN: Do you want to catalog the specific things that you heard that were recommendations just to have them all in one place? Can we do that? Do you want to —
MS. KLOSS: I think we can do that offline, and I’m really interested in using the remaining time to get everybody’s head into the minimum necessary space, because we really have to pull that — just finalize pulling it together. We think we still have some holes. So it would be helpful to get everybody’s thinking. We were going to kind of do that on our call on Friday, but it was so late. It was set up too late to have good attendance, and I understand it was a long shot.
MS. HINES: One question I have is do you want to try, because almost everyone is here, to set up a call for next week, or do you want to have a doodle go out for that since almost everyone is here? For minimum necessary. It’s just something to think about.
MS. KLOSS: I would rather talk about the content. I mean, do you have anything that — I would like to make sure everybody knows what’s there now.
MS. SEEGER: So for our tentative agenda, we have opening remarks beginning at 8:15, with Linda, and then at 8:30 an overview of framing of the issues with Mark Rothstein who is confirmed.
MS. KLOSS: And your discussion with him, could you share that a little bit?
MS. SEEGER: We have had a number of discussions with folks, and to be honest, many of the people who we approached, their reaction was I’m not really sure I have anything to say on this topic, and once we really start talking about it, they realize that this is really an area that might be very misunderstood. It certainly is an area that OCR has seen many complaints in, and there is a real potential here for there to be — at the very minimum, OCR owes guidance on this topic. HHS was required under HITECH to develop guidance on minimum necessary by 2010. So we are late. And we have been noted to be late by Senator Franken. So it’s good to move on.
So Mark is very happy to give an overview and framing of this, and we look forward to his perspective.
MS. BERNSTEIN: He does have a particular perspective on this stuff. He has strong feelings about this stuff.
MS. SEEGER: As do the folks on our next panel. The first panel is going to focus on policy interpretations and Bob Gellman will lead us off and then another legal perspective will be provided by Adam Greene, formerly of OCR, who will also bring that lens. So that will be helpful. Then followed by Marilyn Zigmund Luke from AHIP. So we will have the health plan perspective.
Then we will break and move into panel 2, which are practical implementation of the minimum necessary standards and approaches for compliance, and as you may recall Linda was instrumental in helping us have AHIP survey their members. Excuse me, AHIMA. And AHIMA’s president, Melissa Martin, will provide an overview of that survey and their findings. She can also provide the perspective in her role as Chief Privacy Officer for West Virginia University Hospitals.
And then we have a pending from Brandon Neiswender from CRISP. This is the HIE in Maryland, the Chesapeake Regional Information System for Patients.
MS. KLOSS: We feel we want one more panelist on this panel too, and we had invited the American Hospital Association and they were not able to provide testimony, nor was MGMA. So we are in the process of reaching out to Healthcare Compliance Association.
MS. SEEGER: Yes, Bryce Mill(?) at HCCA.
MS. KLOSS: Are there any other suggestions?
MS. BERNSTEIN: For provider – health industry kind of type?
MS. KLOSS: Stewards.
PARTICIPANT: Primary Care Association?
MS. KLOSS: We could try.
DR. MAYS: We did have APA on here.
MS. BERNSTEIN: You want a general practitioner sort of a group.
MS. SEEGER: I think we are looking for practitioner that can speak to compliance challenges.
DR. STEAD: Do you have somebody that can bring in the mental health issue?
PARTICIPANT: Next panel.
MS. SEEGER: We can walk through the next panel and then come back. So then we are going to break for lunch. Panel three are a discussion of challenges and opportunities. So we have Alan Nessman from the American Psychological Association practice office. We have Darren Dworkin, who is the senior VP and CIO of Cedars-Sinai who I think we heard loud and clear today how security is really tied into many of the privacy requirements for compliance. So Darren will speak to user controls, audits, and other solutions that they have implemented.
Next we have MGMA on this agenda. It needs to come off, because we are not able to have Rob join us. Then we have Rita Bowen, who is going to be representing the American Health Information Outsourcing Society. She is a former past president of AHIMA, and she is with HealthPort. Oh, she’s with MRO.
MS. KLOSS: Bill, any other suggestions?
DR. STEAD: That’s it.
MS. KLOSS: I think what this illustrates is how poorly understood these concepts are, and I was fascinated in our testimony yesterday that we had reference to data minimization, and there’s some overlap here, and I think bringing clarity to this topic and framing it will be a really important outcome of this hearing. When you are on the provider firing line, minimum necessary is thought of as something that I can use to push back when somebody asks me for 400 complete records to be copied and shipped. You know, and there’s just kind of nothing that you can do to say wait a minute. Do you really need all of this? There isn’t any practical guidance.
But then there is the other part of it which is — as a disclosure management, it could be of use, but then the broader use of it, while we talk about it from the FIPs perspective means don’t collect it in the first place if you don’t need it. So I think we need to go back here and understand — I would like to kind of — and maybe you can help us somewhat. We have these concepts. How have they evolved, and how are they now considered, and how are they considered? How are they being framed in the HIPAA law? What was the guidance supposed to be about? Was it this initial only collect what you need perspective, or only disclose what you need, or both?
MS. BERNSTEIN: So collect what you need is a hard thing in the medical community, because you don’t know necessarily what you need in advance. You’re doing an investigation, and it’s sort of similar to law enforcement. You don’t necessarily know the conclusion you’re going to come to when you’re starting to gather evidence. So I think we don’t sort of limit doctors in what they can collect. They don’t know what might be relevant. So they collect whatever they can that they think might eventually be relevant and try to put it together and make a story that turns into a diagnosis on this particular patient.
But I think I have always assumed therefore that minimum necessary had to do with disclosure, but when you — disclosure means on the other end, there’s somebody who is the recipient, that person is collecting. Maybe not creating in the sense that a doctor is producing information when they talk to a person and make a history but when they are disclosing information, then on the other end, is that recipient collecting data in the minimum necessary way? So there’s kind of a tradeoff.
MS. SANCHES: I don’t have my rule in front of me, but in general, we think of minimum necessary as a way of thinking about and managing how information is shared internally and externally. So looking at who in your organization really needs access to information and then using particular safeguards to prevent people who don’t need access to certain information from getting it. And then also looking at that in terms of disclosures to other people. So you would never give the entire medical record unless they have actually asked for it and there’s a reason for it.
Minimum necessary doesn’t apply to treatment disclosures at all, but it does apply to entities in terms of their internal uses of data. So really it is a role-based access concept.
MS. KLOSS: Well, then I think we will have the picture of what the current world is, at least in hospitals. We don’t have the ambulatory perspective as strongly. We had hoped to get that from MGMA.
DR. RIPPEN: You also don’t have an EHR vendor, because it goes to what is out of the box or not, right, with regards to the technologies to make it easier or difficult.
MS. BERNSTEIN: The reason I went to look at this subcommittee’s letters and hearings is because back here, like 10 years ago now, where we talked about particularly sensitive information, it’s similar to the idea of minimizing — of minimum necessary, right? It’s similar to the idea of only using that which is particularly necessary to the function that you have. So I wonder if we go back, there were a series of hearings there; I wonder if we go back there and look at who spoke then whether the ones that spoke then would be willing to update or expand on something that happened, what they talked about 10 years ago. So it might be another source.
MS. KLOSS: I think knowing what the release of information world looks like, I think it would be a very eye-opening hearing. So we will go back and consider the EHR vendor perspective and I think that’s an excellent idea, and we will go back and look at where we have touched on this and how it has worked.
DR. RIPPEN: The other thing that might be interesting is there was a lot of work done on the continuity of care documents sort of thing as far as what people thought was minimum necessary for transitions in care, and I think it was pretty significant amount of work, because for physicians to agree across the board is pretty incredible. So again, that might provide at least a nuanced kind of insight into that for clinical care, because people don’t want everything usually in a clinical perspective, unless there is an issue that they are investigating.
MS. BERNSTEIN: Years ago, when we were doing this, we heard from the emergency people saying I need these five things. I need to know allergies. I need to know current diagnoses. I need to know — I can’t remember what the five things were. But we also had this very animated discussion between the doctors and the lawyers about lawyers are, you know, people have rights and they can withhold information and they are allowed legally to make stupid choices that will cause them bad outcomes, and doctors, the doctors were on the whole — but I am painting with a very broad brush — but doctors were on the whole we want to fix people and we need information to do good and to help people, and we don’t want to be prevented from having any information. So we want all the information available to us.
And there were expressed concerns about liability if you don’t have access to certain information and so forth, which I found surprising, because as a lawyer I’m thinking, well, if I can show that you didn’t have access to that information at that particular time when you were making a decision, you shouldn’t be responsible for that information, and the doctors were saying I want all the information to make the best decision, because otherwise I’m going to be liable and I’m thinking, there’s no way you’re going to know every piece of data in the record. That seems more scary to me in terms of liability. But this went back and forth quite a bit.
I think we may reanimate that discussion to a certain extent with this hearing.
MR. COUSSOULE: Part of what probably needs to be covered is both the desire to minimize the exposure without using it as an excuse not to share, because that happens in a number of different ways. So I think it would be useful to explore what the motivations are and then how to get to the actual value proposition in there as opposed to what may be a motivation that doesn’t get there.
MS. BERNSTEIN: I just want to make sure that we capture whatever next steps you want us to do before the next phone call or the next hearing.
MS. KLOSS: I think we should have a phone call, and in preparation for the meeting.
MS. BERNSTEIN: Next week or as soon as we can get one?
DR. RIPPEN: Do you just want to send an email with your ideas?
MS. KLOSS: Well, what’s everybody thinking? Do we want to just take a break for a week and do it the week after? Next week? Phone call next week. All right, we will try to take up next steps on both of these hearings. I think we can do that.
MS. HINES: So you want one call to do both.
MS. KLOSS: I think it’s time the subcommittee come back together again. It was helpful to have us working in small groups, because we were able to fast track two things, but now I think we need to work as a subcommittee.
MS. HINES: Sixty or 90 minutes for that?
MS. KLOSS: Let’s do 60.
PARTICIPANT: Amazing work. So whoever helped with pulling everybody together. So thank you.
MS. BERNSTEIN: Rachel is responsible for putting this hearing together. Let’s also thank staff.
(Whereupon at 4:05 p.m., the meeting was adjourned.)