Department of Health and Human Services


Subcommittee on Privacy, Confidentiality & Security

“De-Identification and the Health Insurance Portability and Accountability Act (HIPAA)”

May 24-25, 2016

Hubert Humphrey Building
200 Independence Avenue, SW – Room 705-A
Washington, DC 20024

The National Committee on Vital and Health Statistics Subcommittee on Privacy, Confidentiality & Security convened a hearing on May 24-25, 2016.  The meeting was open to the public and was broadcast live on the internet.  A link to the live broadcast is available on the NCVHS homepage. 

Committee Members Present

  • Nicholas L. Coussoule
  • Barbara J. Evans, Ph.D., J.K., LL.M.
  • Linda L. Kloss, M.A., Chairman
  • Vickie M. Mays, Ph.D., M.S.P.H.
  • Sallie Milam, J.D., CIPP, CIPP/G
  • Robert L. Phillips, Jr., M.D., MSPH
  • Helga E. Rippen, M.D., Ph.D.
  • Walter G. Suarez, M.D., M.P.H.

Staff Members

  • Maya Bernstein
  • Rebecca Hines, MHS Executive Secretary
  • Rachel Seeger, OCR
  • James Scanlon, HHS Executive Staff Director
  • Linda Sanches

Hearing Presenters List

  • Micah Altman, PhD
  • Daniel Barth-Jones, MPH, PhD
  • Cavan Capps, CISSP
  • Sheila Colclasure, MA
  • Jeptha Curtis, MD
  • Michelle De Mooy
  • Yaniv Erlich, PhD
  • Kim Gray, JD
  • Bradley Malin, PhD
  • Jacki Monson, JD
  • Jules Polonetsky, JD
  • Ashley Predith, PhD
  • Ira Rubinstein, JD
  • Vitaly Shmatikov, PhD
  • Cora Tung Han, JD


Tuesday, May 24, 2016


Prepare a summary report for the Full Committee report.

Deliver a letter to HHS Secretary with policy recommendations. 


OVERVIEW AND FRAMING OF CURRENT ISSUES – Dr. Simson Garfinkel, Information Access Division, National Institute of Standards and Technology

HIPAA sets forth methodologies for de-identifying protected health information (PHI). Once PHI is de-identified, it is no longer subject to the HIPAA rules and can be used for any purpose. The U.S. Department of Health and Human (HHS) Services Office for Civil Rights (OCR) issued guidance in 2012, specifying two ways through which a covered entity can determine that health information is de-identified: (1) the Expert Determination Method and (2) the Safe Harbor Method. Much has changed in the health care landscape since that time, including greater availability and use of big data. Concerns have been raised about the sufficiency of the HIPAA de-identification methodologies, the lack of oversight for unauthorized re-identification of de-identified data, and the absence of public transparency about the uses of de-identified data. The purpose of this hearing is to gather industry input on existing guidance and possible limitations of the de-identification methodologies for making recommendations to the Secretary of HHS.

NOTE:  For further information about presentations, please refer to transcripts and Power Point presentations.


Dr. Garfinkel began with an overview of the de-identification of personal information and the Health Insurance Portability and the Accountability Act.  There is an ever increasing interest in the practice of de-identification as many health care providers wish to share patient data for research purpose.  Doing so drives the need to protect patient privacy.  According to the current HIPAA privacy rule, properly de-identified health information can be distributed without restriction.  De-identified data uses extend beyond healthcare and can be found in the banking and advertising industry as well.

However, there are challenges knowing that de-identified datasets can be re-identified.  Data identifiability depends, in part, on the difficulty of linking data sets with other data.  Complexity of this topic was illustrated with highlights of techniques such as generalization, field swapping, and adding noise.  The Safe Harbor Rule was mentioned as a tool to prevent re-identification but risks losing the quality of the data.  Other ways to measure the effectiveness of de-identification are the K-anonymity framework and Tiger Team use.  Furthermore, differential privacy is a framework developed to support query systems.  On the other hand, this framework affects the quality of data.  Several approaches to minimize this impact involve in using synthetic or artificial data sets that work on a wide range of data including video.  With the release of widely used data, more de-identification techniques that afford measureable privacy guarantees are needed.


Discussion ensued regarding the importance of making data available on the web versus the issues presented.  Efforts to standardize curricula for de-identification practitioners and to provide online training are ongoing. What is needed are people trained in this effort.  To date, methodological approaches to de-identification exceed practice in the field.  How can practice be accelerated and what would be the policy drivers to apply science to practice?  No research has been conducted pertaining to using incentives to drive making application of the science available.  However, NIST is working on a number of projects that can provide guidance in this area. 

More research is needed regarding the possibility for HHS or others to offer technical tool kits to set up data query systems that are common to all states.  A question was raised about customizing tools for states that use MONAHRQ, in which Dr. Garfinkel replied, more research is needed as there might be data aggregation issues.   NIST is engaged in privacy research to encourage more participation in the data economy.  A de-identification pilot study is being conducted this summer to evaluate de-identification software.  

PANEL 1Policy Interpretations of HIPAA’s De-identification Guidance

  • Ira Rubinstein, JD, Senior Fellow, Information Law Institute; Adjunct Professor, New York University School of Law
  • Bradley Malin, PhD, Vice Chair for Research, Department of Biomedical Informatics. School of Medicine; Director, Health Information Privacy Laboratory; Vanderbilt University
  • Daniel Barth-Jones, MPH, PhD; Assistant Professor of Clinical Epidemiology; Mailman School of Public Health; Columbia University

Dr. Malin

Several conflicts of interest were disclosed.  Investigating the de-identification process is important.  However, the manner in which it is currently performed may be safe.  A review was given regarding the study initiated by HHS in 2011 that involved all known actual re-identification attacks which resulted in the average rate of identification as .013 percent.  The most pressing matter concerning de-identification in HIPAA is the Safe Harbor model.  The unlimited features are a compounded problem and because nothing is explicitly identified, there is no way to stop this information from being disclosed.    The second concern is a natural language statement, which evolves from lengthy conversations that can be added as context in a patient electronic medical record.  Further examples of de-identification adversarial threats, security breaches, and protection models were covered. 

A segment of the presentation was dedicated to explaining which risk measurements should be used with regard to current HIPAA de-identification guides.  Risk means something different to everybody.  Therefore, risk measurements are still an open question.  Using time-limited determination to reduce the exposure of existing data in the market was suggested.  However, the appropriate amount of time is unclear.  Also mentioned were models that suppressed data.  It was further noted that IRB approval using such models is unlikely, as they have no guidance as to whether or not accepting proposals using the shift and truncate model is permissible.  What recommendations would you make to help keep policy at pace with or ahead of technology?  It would be to have a clearinghouse for best practices in de-identification and come to an agreement for setting “small” risk.  Additionally, if oversight were to become a government initiative, there is a cost associated with this time consuming undertaking. 

Dr. Rubenstein

This review was based on his recent article with Woody Hartzog, where they reviewed the relevant legal and technical literature around the de-identification and anonymization.  Incidents that demonstrate the failure of anonymization where re-identification can be performed poses disagreement among experts from various disciplines.  Emphasis should always be on minimizing risk instead of only preventing harm.  There are two opposing views and representatives of both sides of the argument in the literature underscore the disagreements.  Dr. Rubenstein mentioned that there is a high level of scientific discord between the two groups.  The conclusion of this study shifted the debate by reframing the argument to focus on data release policy to minimize risk.  Policy makers encounter challenges in determining if improvement or radical change is needed.  Other agencies should look into the relevancy of other methods that help make sound decisions about data release.  While methods of releasing data to the public are more risky, the scientific community is not as large and there should be some compromise. 

Other topics covered included looking at the de-identification process in a broader spectrum relating to disclosure limitations, techniques and related tools.  Relevancy in dissemination-based access and query-based access, collaboration across the divides, and expanding the focus of privacy rule to include other methods and techniques were detailed.

Dr. Barth-Jones

One conflict of interest was disclosed.  This review began with the presenter providing his extensive understanding of the essential role that accurate data is sourced from, health systems research and the importance of using modern statistical disclosure risk assessments.   Dr. Barth-Jones presented examples of misconceptions pre and post-HIPAA compliant de-identification process.  There is no guarantee that de-identified data will remain de-identified.  The presentation detailed the point that there is a trade-off between information quality and disclosure protections. 

Moreover, some de-identification methods can statistically degrade the accuracy of data. Good public policy demands reliable scientific evidence.  Other examples in which re-identification science has failed to support public policy thoroughly was provided with an understanding that there are some unavoidable tradeoffs between data privacy and statistical accuracy and its implications regarding harm.   However, some suggestions to better inform public policy and practice included:  using statistical random samples and scientific designs to provide representative risk estimates; verify re-identifications; the need for ethical study designs regarding re-identification; investigate multiple realistic threats; have additional process controls; and implement appropriate data security and privacy policies, procedures and associated administrative technical and physical safeguards. 


A suggestion was made to establish a definition and guidance for re-identification that involve linking to the 16 identifiers.  A question concerning what is an acceptable level of risk exists.  However, decisions for thresholds with respect to potential identification were made by public health officials many years before HIPAA.  That precedent may be a starting point to create a standard. Discussion continued around the topic of wrongful re-identification and the harm associated with that is relevant.  Views reiterated the danger of prohibiting re-identification.  People who are re-identified are good sources to indicate re-identification behavior.  From a legal perspective, the courts are examining this under the Fair Credit Reporting Act of inaccurate information.  

There was agreement regarding what is acceptable risk.  Zero risk is not a feasible solution but rather changes in regulatory structure.  An example was given depicting how public health reporting aided in the discovery of the Flint Michigan water crisis to emphasize the need for data release.  One panelist noted that we have to be able to at least count what is going on as it relates to health and causes of diseases.  There is a balance required in designing a system that protects the privacy and yields information critical to health population efforts.  Research value cannot be obtained if data is not made available.  Within the health domain, determining the harms is challenging as different data also has different risks, and outcomes.  Addressing barriers to access and using innovative approaches to data release is valuable.

A question was raised about obtaining certification before linking to other data and how that impacts the researcher.  De-identified data that is linked to another dataset using quaisi-identifiers creates a new data set with re-identification risks.  A recommendation to support researchers was to increase the number of de-identification experts.  Further discussion focused on the nature of data, such as genetic information, and social media data and its increased risk for re-identification.  Genetic information poses concerns regarding the inferential disclosure component as well as its uniqueness.  Therefore, genetic information should not have passage through Safe Harbor.  An additional concern is misidentification due to disclosure of hereditary information.  As it relates to social media, there are individuals who self-disclose and then there are those who disclose the health information of others.  This raises concerns about consent.

In 2014 NIH issued genomic data sharing policy recommendations.  NIH requires everyone to get broad consent to have information deposited into the genotypes and phenotypes database.  Advantages and disadvantages of this policy were deliberated.  Other topics discussed included: Statistical disclosure limitation techniques that have been focused on aggregate statistics and how the computer science community is following the trend; considering privacy as a system problem and taking system type approaches to finding solutions; and recommendations to have more HHS sponsored research in order to study the risks of Safe Harbor. 

Has there been an assessment of the gap in skills and competencies in our health system?  Investigating this would mean taking into account the diverse nature of health care and all of the data generated.  Statistical experts such as biostatisticians, informaticians and computer scientist are not trained in it.  Additionally, the gap remains unknown as organizations use other tools for obtaining semi-identified limited datasets. 

Re-identification is expensive, costly, and difficult.  There is a time issue around the concept of de-identification.  Therefore, identifying policy drivers to improve the degree to which data is protected from re-identification is necessary.  NIH sponsored a center of excellence in ethics research that will be ongoing for the next four years.  One approach to addressing ways to control the system involves controlling risk in the system by using a semi-credentialing process.  Other suggestions included: creating a mechanism for revisiting these issues on a more regular basis; and participating in cross-disciplinary collaboration to gain a better understanding of the varying views.  

De-identification of survey data; commercial data entrepreneurs, data warehouses, Google, and other sources are all examples of data moving outside of the traditional realm into a market-based environment.  Concerns emerge if there are no limitations to what you can do with data and it turns into data exploitation.  It is true that there are data companies outside of the scope of HIPAA.  A member asked a hypothetical question:  If you were starting a company that collects large amounts of data, what would the elements of a new program to make the data available to the public look like?  One panelist responded that in the design phase, bringing in de-identification and re-identification experts to gain an understanding of both sides of the argument would be key.  Another panelist mentioned that it has not yet been resolved how to provide an open data environment for research purposes and respect privacy at the same time.   One other panelist mentioned that they are currently working on a pilot project to figure out how to do ingressing of data in a privacy-preserving manner.

PANEL 2De-Identification Challenges

  • Michelle De Mooy, Deputy Director, Privacy and Data Project, Center for Democracy and Technology; Washington, D.C.
  • Jules Polonetsky, JD; CEO, Future of Privacy Forum; Washington, D.C.
  • Ashley Predith, PhD; Executive Director. President’s Council of Advisors on Science and Technology; White House Office of Science and Technology Policy; Washington, D.C.
  • Cora Tung Han, JD; Federal Trade Commission, Bureau of Consumer Protection; Washington, D.C.

Ms. De Mooy: 

CDT is a non-partisan non-profit technology policy advocacy organization dedicated to protecting civil liberties and human rights on the internet, including privacy, free speech and access to information.  Big data in health care, by definition refers to electronic health datasets so large and complex that they are difficult to manage with traditional hardware and software.  Big data concerns are inclusive of collecting various types of data, the speed of managing the data and providing the ability to share information.  The healthcare industry’s use of Information Technology is ever changing with the shift to electronic health records and data collection.  Researchers in healthcare as well as non-traditional entrants are making use of big data to improve and disseminate information about effective and preventive treatment strategies.  However, the advancement in big data raises privacy issues. 

The Fair Information Practice Principles were interpreted to require entities to disclose the purpose for using the data and obtain new consent for secondary uses.  Existing practices are insufficient to handle the increasing data flow.  Genomic data has its own challenges in privacy security practices.  An overview was provided of identified DNA databases and the impact in: the market place; new risks to identification; policy barriers; research; innovative application; and the privacy protecting measures for de-identified data.

Recommendations included moving to FTC’s objective standard; developing one vehicle for researchers to use limited data sets without obtaining prior authorization; re-conceptualizing research uses; supporting techniques such as research in effective anonymization and generalizations; and issuing a risk assessment tool.

Jules Polonetsky: 

The Future of Privacy Forum is a think-tank of privacy ideas.  The organization realizes how data and technology are changing the world; at the same time understanding the need to minimize risks.  Topics addressing both sides of the argument included written or interpretation of policy, attacks on pseudonymous data sets, and unique uncontrollable identifiers that are available to anybody in market places.  Attention was given to risk controls with set values. 

Suggestions were made to establish a spectrum of data levels with applicable rules at each stage.  Additionally, there is an urgency to address the importance of making ethical decisions; particularly for the sectors that are not covered by common rules like IRB’s.

Ashley Predith

In January 214, President Obama requested a study be conducted looking at how big data challenges are being confronted by the public and private sector.  The outcome was The White House and PCAST Reports.  PCAST focused specifically on the technology aspect.  Data is collected on a continuous basis with or without our knowledge through various digital sources such as computer “cookies,” mobile apps, etc.  A technological intervention that may benefit the healthcare industry is the creation of protective mechanisms in databases that support patient treatment.  Another benefit would be developing apps for monitoring geriatric patients. 

Cryptography for cybersecurity, anonymization and de-identification, data deletion in ephemerality, and notice and content were detailed as the four technological strategies reviewed in the study.  From this assessment, additional technology such as privacy and profiles, context and use question, and enforcement and deterrence using auditing technology, were identified to complement the previous strategies.   Other recommendations included encouraging education and training opportunities concerning privacy protection.  The report concluded that focusing on the use of the data is the most technically feasible place to apply regulation. 

Cora Tung Han:

FTC is a civil law enforcement agency with consumer protection and competition mandate.  The focus of the Division of Privacy and Identity Protection is to promote data security and privacy protection through enforcement work, policy, and consumer and business education.  Dr. Han shared highlights of the May 2014 Data Brokers Report entitled, A Call for Transparency and Accountability.  In addition to digital sources, consumer information is collected from magazine subscriptions, store purchases, and publically available outlets.

Risk mitigation products, consumer benefits, and security risks were underscored.  The presentation also discussed the data segmentation; consumer file creation; and data storage and marketing.  The report concluded that the data brokers industry is complex by virtue of how they store billions of data elements, conduct analysis to make inferences about the consumer, and manage online marketing practices.  Recommendations regarding the creation of legislation that provides consumers with access to information, and implementation of best practices were made.


Discussion began with a question posed to the panel regarding consumers owning the data through its full cycle.  While it is important for individuals to have as much control as possible, data repositories that are privacy protected should provide the ability in the health industry.  Others agreed that people will contribute their data to research if they have an understanding of its use.  An opposing view noted that using de-identified data with the ability to revoke consent poses risks.  There was agreement that in the commercial context, it is unlikely.

A question was raised regarding perspectives on genomic data.  Technical and policy measures are needed to address restricting information and understanding the impact of genomic information becoming part of the medical health record.  Concerns about the absence of ethical oversight were mentioned.  The OCR plans to issue guidance on the status of genomic data. 

Discussion continued pertaining to notice and consent.  There is an ongoing effort to develop ways to communicate with consumers, provide notice and obtain consent.  A stronger process to facilitate getting consent should be required.  One suggestion was offered to create a process that uses technology in the decision making process around privacy rights.  Additional comments were based on building privacy into the life cycle of the management process, developing business ethics policies around use and addressing social determinants.  Ethical constructs, principles and education components were discussed at length. 

Framing of Issues by Subcommittee Members

Discussion began with addressing the overall question of what is it that the committee should focus on.  A question was raised about how intent relates to re-identification.  A stronger emphasis on the intent side will help to answer the question: When you have de-identified data, is the intent to use for a health intervention or is the plan to re-identify?  Special attention should be given to: providing advisement for policy around protecting data, providing guidance and tools for the entities that disclose the data, keeping up with technology, making use of the robustness of de-identification in safe harbor.

A comment was made regarding who is responsible for the data.  The covered entities are limited in their data protection efforts, yet the public is unaware that HIPPA fully protects their data.  Lacking in the definition of risk, establishing appropriate levels of risk, and determining what an acceptable risk should be is a priority.  The committee can review what SDL techniques can be applied in addition to Safe Harbor as well as research other models.  Clarification is needed regarding enforcement and covered entities that follow rules and their accountability for a breach.  A balance is necessary between who has the data versus who breaches the data.  Once you leave the HIPAA realm, there are different sets of rules for the same data. 

Focus should address what is the targeted use and intent.  Strengthen the de-identification process environment for the originator and the end user.  Provide advice as it relates to differentiating data; research and commercial.  Take a data stewardship approach and help both the consumer and the covered entities to understand the risks involved in bundling data.  Review best practices and keep up with the technology. Acknowledge the risks of mis-identification as well.  Research collecting information across the government and the benefits of integrating sensitive information to effectively address challenges in population health. 

A suggestion was made to use caution when considering a bill to regulate penalties.  However, clarification was given about the subject noting that the recommended bill was to support contractual obligations in order to prevent people from re-identification, and not to regulate with penalties in a civil or criminal manner.  One recommendation included forming a committee to determine what tools are available that the government can use, which was supported by a suggestion to consider High Trust; an organization that just released a de-identification framework inclusive of an evaluation component.

Public Comment Period

There were no comments


Meeting was adjourned at 4:55PM

Wednesday, May 25, 2016

Committee Members Present

  • Nicholas L. Coussoule
  • Barbara J. Evans, Ph.D., J.K., LL.M.
  • Linda L. Kloss, M.A., Chairman
  • Vickie M. Mays, Ph.D., M.S.P.H.
  • Sallie Milam, J.D., CIPP, CIPP/G
  • Robert L. Phillips, Jr., M.D., MSPH
  • Helga E. Rippen, M.D., Ph.D.
  • Walter G. Suarez, M.D., M.P.H., (Phone)

Staff Members

  • Maya Bernstein
  • Rebecca Hines, MHS Executive Secretary
  • Rachel Seeger, OCR
  • James Scanlon, HHS Executive Staff Director
  • Linda Sanches


PANEL 3  –  Approaches for De-Identifying and Identifying Data

  • Vitaly Shmatikov, PhD; Professor of Computer Science; Cornell; New York, NY
  • Jacki Monson, JD; Chief Privacy Officer; Sutter Health; Sacramento, CA
  • Jeptha Curtis, MD, FACC; American College of Cardiology
  • Cavan Capps, CISSP; Big Data Lead, US Department of Census, US Department of Commerce, Washington, DC
  • Yaniv Erlich, PhD; Assistant Professor of Computer Science, Columbia University, Member, New York Genome Center; New York, NY (Note- The agenda was adjusted and Dr. Erlich joined Panel 3)

Yaniv Erlich:

Dr. Erlich’s presentation began with an explanation of our new era of genomic information driven by initiatives that influence the need to share datasets.  Additionally, companies such as AncestryDNA and others, offer services to genotype your entire genome.  After several weeks, you can download a text file from their website.  The genetic genealogy community built their own tools for their website and created large databases.  There are websites to crowd-source genealogy information where people can upload their entire family tree and also collaborate.  There is a vast amount of information available online.  Dr. Erlich provided examples of how his group, using these unrestricted websites, was able to download the entire information from a website and obtain 18 million records.

However, de-identification problems will grow as people add more identifiers to the data.  He concluded that there is some confusion in the community about the status of DNA information in Safe Harbor, as well as knowing where DNA fits into the NIH issued policy.  Therefore, instead of focusing on protecting the privacy of individuals, perhaps the focus should be on building trust relationships with the community. 

Vitaly Shmatikov:

An overview was presented with an emphasis on factors affecting de-identifications.  The vast sources of de-identified public information make it easy to link with other sources and learn about individuals.  It is believed that de-identification occurs by removing certain attributes.  However, any combination of attributes can be identifying and existing de-identification techniques do not take this into account.  De-identification separates data from users of data thereby destroying its utility.  The process of using data in a privacy preserving way is a holistic approach.  Current techniques also fail to support machine learning.  Although there is progress in machine learning with the development of new platforms, nothing is in place to address privacy. Technological limitations are another factor impacting de-identification.   With a focus on harm to individual users, there is a need for legal and regulatory measures.  To date, research communities and machine learning communities are developing technologies and secure computing that help with data protection. 

Jacki Monson:

The presentation highlighted transparency for patients, the importance of data for medical research and innovation, the current state of de-identification, and the benefits and concerns for re-identification.  There is a standard approach to transparency.  Therefore, the discretion is left up to the user of the data.  Patients feel more comfortable about having their information shared when the importance of their contribution to medical research is explained.  Mechanisms to obtain authorizations such as electronic consent or better needs to be available.  Opting-in and opting-out has worked well for HIEs.  The electronic environment and interoperability does not lend itself to medical research from a precision medicine perspective. 

Currently, there are no means to link accurate data electronically within the privacy guidelines.  Challenges particularly surface during de-identification.  Furthermore, to re-identify data is nearly impossible.  Using the expert method for de-identification purposes is not feasible.  The expert approach is expensive.  When the cost/benefit analysis is conducted, it strains the research budgets thereby prohibiting organizations to proceed with conducting studies. 

The 18 identifiers are subjective.  Therefore, moving towards interoperability should also include flexibility and provide guidance.  In reviewing software that statistically de-identifies data, it was noted that meeting the needs of regulators requirements is important.  The presentation also reported on data control with regard to adding language to agreements; especially with vendors whose authorization is required for use other than the initial purpose.  While their practices are considered stringent in the industry, it is necessary because there is no regulatory guidance.  Furthermore, the organization is unable to audit as a means of ensuring that re-identification is not taking place.  Recommendations for addressing regulatory guidance and controls when re-identification process is used for the purpose of benefiting the patient. 

Jeptha Curtis:

The American College of Cardiology is a professional society that participates in a number of activities aimed at lowering costs, improving healthcare and health outcomes.  As an organization that works with American College of Cardiology, NCDR runs a number of registries that collect personal health information.  An overview was provided about the mission of the organization, the purpose of the tools and mechanisms provided for the healthcare industry to deliver better care and sources of data.  Examples were provided where the organization successfully linked data for medical research purposes to improve patient care.  Business associates agreements and approaches to de-identification were discussed.  Although ACC and NCDR have a multifaceted approach to minimizing risk, it was noted that there is a need for richer multidimensional data, clear standards for handling PHI, governance, ad policies and processes on data security. 

Cavan Capps:

Census is working with big data.  The agency conducts 300-400 different surveys, which generates a variety of information.  In the health arena, as with Census, there is a need for data to be relevant.  The benefits from precision medicine come with achieving the level of granularity needed for proper analysis.  Having the ability to link data produces quality outcomes.  While linking data is a critical part of determining quality, there is an opportunity to reduce the cost of medicine.  However, seeing the relationships between people’s genetics, behaviors, and treatments is challenging. 

The privacy and big data collision, and a common privacy data architecture were detailed in this presentation. Mr. Capps described how to create a way of computing where identifiers are not kept with the data.  The advantage of this is the fact that the process protects against identity theft.  Provenance, determining the commitments to patient was emphasized in his overview.  Additional highlights included ensuring that data is operated on a system to allow privacy auditing and challenges with linking data in various environment.  To date, three models are currently being reviewed for analyzing data: 1) an enclave that might allow people to build interfaces to it for a model; 2) use formal privacy techniques and create a synthetic data set; and 3) secure multi-party computing.   


A question was raised regarding the interdependency of privacy and genetic data.  Using an example from a webpage in, one panelist explained how that process is used to reduce the search space and narrow down their biological family.  Interesting differences between two presentations were expressed.  Data sharing has a host of issues that include: being hard to manage; decreased utility; lacking audit capacity; using contracting mechanisms that are difficult to track; and never knowing who is accessing the data.  On the other hand, the ACC which uses registries, displayed their partnerships with hospital specific data elements and detailed how they were using it meticulously.  There was a follow-up question asking, how to create trust relationships in revealing data.  In particular, one panelist mentioned that some registries are mandated and requires their participation even though good security controls are not in place.  Therefore, guidance around registries, privacy and security would extend trust and transparency with patients. 

In response to the approach to separating the data from the person regarding de-identification, a panelist commented that the use of pseudonyms create patient safety issues by not being able to link data together.  Guidance around use cases would help the industry in their approach to the experts.  Also, creating mechanisms to submit use cases for evaluation is suitable.  Another perspective relating to de-identification between data and the person was discussed.  Researchers may use a de-identification process by determining which attributes are identifying.  Others develop techniques to understand the data for precision medicine and other forms of research.  However, privacy and protecting individuals for these two lines of inquiry were detailed as separate processes.  Moreover, advantages of an integrated approach for quality outcomes were emphasized. 

A discussion ensued about transparency as panelists explained their views about the culture of business agreements from a statistical agency perspective and its impact in the industry.  Business agreements are non-transparent.  Although HIPAA allows CDC to get hospital data, they would not share data without a business agreement.  This is a culture that supersedes the government’s right to have access the data. Additionally, it is unknown how transparency is affected by provenance outside of that arena. 

While transparency is needed, there is no means of tracking the data even though multiple databases and systems are available.  One panelist noted that in our current state of technology, there is a conscious effort not to have systems interacts with one another.  Most systems that are proprietary, benefit EHR companies and therefore present an interoperability issue.  However, Meaningful Use offers incentives for vendors to have some data to become interoperable.  Another panelist agreed that tracing data is difficult but forward thinking is to centralize the information on the cloud where researchers have processing ability.  An opposing view pointed out that in theory, it may sound great but from experience it is manual, cumbersome, and the technology is not yet available.  Transparency should initially be considered with use of data. 

A comment was made suggesting that the committee not view technology and transparency from a health system perspective, but rather make a distinction between the research and development side and individual patient point of view.  There is no guidance with regard to provenance and intent of use.  In order to address this issue, technology has to be designed to evolve and adapt to the frequency of new uses.  To some extent, lack of interoperability is influenced by vendors providing proprietary equipment because of the financial benefit.  The de-identification standard concerning genetics needs an innovative layered approach with transparency.  Also, improving the communication process with patients on a regular basis and informing them about how their data is being used is necessary moving forward. 

Navigating all of the different standards of de-identification is challenging from a providers perspective.  Researchers are often frustrated as they are engaged in the evaluation process realizing that the data may not be useable after all requirements are removed.  In the case of sensitive diagnosis such as mental health or HIV status, there are regulatory requirements that pose challenges.  However, pursuing evaluation at the federal and state level was suggested.  Similarly, the more privacy laws that are on evaluation makes data more challenging to use.  One participant noted that in the case where multiple diagnosis occurs, or in circumstances where genetic information is included, it is hard to determine the long term implications in a patient’s life.  This is the result of dealing with information where the value is unknown.  However, protecting harms may be a valuable path forward instead of privacy.

What can be done to make the expert determination method a more viable approach?  A technology solution is most cost effective.  The pool of experts is limited and expensive.  Machine learning as well as the risks is poorly understood.  However, transparency would help to show how different attributes are used in algorithms.  A question was raised regarding how to encourage people to participate in genome projects while educating them on the unintended consequences of doing so; especially as it relates to vulnerable populations.  A presenter provided a scenario for what an individual would go through as a participant bringing genome to including the opt-in process. 

Furthermore, it is a social responsibility to encourage principles that eliminate discrimination and racism.  Likewise, it is beneficial to share with the patient how participating is advantageous.  There were opposing views regarding privacy harms.  One panelist is in the process of asking for research to define and quantify harms.  A list of 18 identifiers is not sufficient, but having a process where judging identifiers is based on some set criteria is needed. On the other hand, another participant noted that harm is quantifiable and is active in the research community.  The IRB process was given as an example.  Disagreement was also expressed by suggesting that outlining guiding principles is the best approach because quantifying harm is a subjective term.  Yet in a more broad sense, it is important to explore the concept of harm and what are the implications from capturing so much information in an EHR and what that means in terms of de-identification and re-identification. 

Discussion continued with provenance and trust and privacy as a tool to achieve trust.  Digital markers used to control or restrict re-identification are currently in the experimental phase as a manual process.  To date, the tiered approach is only a concept of evaluation.  A recommendation was made to invest in creating a common language and standards for the tiers. 


  • Micah Altman, PhD; Director of Research, MIT Libraries; Head/Scientist, Program on Information Science; Non-Resident Senior Fellow, Brookings Institution; Boston, MA
  • Sheila Colclasure, MA, Privacy Officer, Acxiom; Little Rock, AR
  • Kim Gray, JD: Chief Privacy Officer, Global, IMS Health

Micah Altman:

Comments in this testimony were informed by research with collaborators through Privacy Tools for Sharing Research Data Project at Harvard.  In terms such as privacy, confidentiality, identifiability, and sensitivity have different meanings in different fields.  Therefore, a recommendation was made to anchor those meanings to a specific discipline when issuing regulations or commentary.  Anonymizations and challenges to identifiability were presented in the context of risk.  Recommendations for examining the entire life cycle of data management from collection through transformation, retention access, and post-access.  This research developed catalogs of controls from literature reviews of legal literature, statistician cryptography.  Likewise, the presentation provided an explanation for a recommendation that supports developing a catalog of privacy controls that is flexible and can grow over time. 

Sheila Colclasure: 

Ms. Colclasure gave the background of AXIOM, the first privacy program.  She referred to “dirty data” as a continuing problem because data degrades so rapidly. Today, most clients aggressively invest in data management because they have interoperability problems.  It takes a technological treatment of the data, administrative controls, and an accountability system to achieve the de-identification.  The Ethical Contextual Interrogation, also known as the privacy impact assessment, is their patent pending program used for privacy implementation.  Ms. Colclasure provided several examples of their process in which they use four stakeholders: a privacy person; a legal person, a security specialist and an engineering person to examine the design, rules and outcome.  Moving into a world where data is used and shared in an observational nature, it most important to consider governance, accountability, enabling benefit and protecting against harm.  

Kim Gray:

Ms. Gray provided an overview of IMS Health, which is one of the leading processors, users, and maintainers of healthcare information worldwide.  HIPAA gives better guidance, allows for flexibility and anticipates changes in technology.  It is a balancing act to protect privacy without hindering free flow of information.  In an effort to increase the number of de-identification experts and improve the HIPAA standard and make it more understandable to people, IMS Health entered into a partnership with HITRUST and created the HITRUST a de-identification standard.  This framework also evaluates the de-identification methodologies by using a mapping process.  Recommendations included addressing governance in de-identification and setting thresholds.


A question pertaining to international and nationally recognized privacy controls was raised.  The FISMA controls are referenced in some work.  There is also work in the NIST big data workgroup.  Privacy is now embedded into the common security HITRUST framework.  While there are examples of organizations that exceed the requirement, most are not able to go beyond HIPAA.  This is directly related to the strains put on any research budget because of the financial and human resources needed.  One panelist described their cross-mapping process using NIST and HITRUST standards.  The cross-mapping of layers of controls will be shared upon approval.  Likewise, HITRUST also harmonizes HIPAA, HITECH, PCI, COBIT, NIST, and FTC. 

How do we create relationships with data collectors for them to deposit their data for its utility elsewhere?  No standardized languages or descriptions exist.  Therefore, it is necessary to establish expert boards to help develop guidance, create incentives and interoperable systems.  The discussion continued with members and panelists sharing ideas about issues regarding managing data and intake tools, mechanisms that keep data from being misappropriated, and case suggested elements of the guidance for use.  Panelists were asked about the commercial aspect of data, its sources, and users.  Explanations were given differentiating the various rule constructs. 

A presenter who sources data from the commercial marketplace discussed their process.  PIA is a classification standard implemented over eight years ago.  Stakeholders come together and make a collective judgement regarding any innovation.  An ethical framework is used as a guiding principle.  PIA applies the human value by asking what is legal, what is fair, and is it the right thing to do.  The manner in which all data passes through regulation was depicted in its rigorous process.  The presenter emphasized that every piece of data has some sort of regulation, constriction, permission or prohibition on it.  This same construct is now being applied to digital data. 

Conversely, another panelist whose healthcare specific data source varies and in most cases unknown described their process.  All data goes through a de-identification process before it is received.  Special teams work with the data.  As a non-covered entity, HIPAA-type model is used.  Therefore, the data is limited as to who can access it.  The data is never allowed to be used for clinical applications.  De-identification is a wonderful tool but not for everything. 

SUBCOMMITTEE DISCUSSION:  Review themes, identify potential recommendations and additional information needs

This time was used to refine the next agenda and consolidate the information from all of the presentations.  This session began with a suggestion to prioritize categories for the committee to address:  privacy data collision; the current state of de-identification including issues with re-identification; provider’s perspective and the cost of the expert approach; developing research; use cases where current de-identification is stalled; life cycle management process; and the incorporation of a more robust process management and process controls.  The committee decided to share notes and segment the important themes under one of the major categories.  Some themes included:  What is harm and what is risk; technology versus fit; translation to practice:  addressing critical issues with regards to moving from science to practice; genomics as a different category that is not classified; and acknowledge the various forms of data and determine the implications of that under the 18 identifiables.

Discussion ensued pertaining to research versus commercial, and current status in de-identification.  In the sense of research using data, a dialogue regarding life cycle and data stewardship.  It was agreed that the research should be categorized under use case because it falls into the data consumption side.  There is also a need to add consumer perspective as sub-bullets.  Transparency, provenance and intent, and de-identification requiring a tiered process are topics that fit into multiple categories.  Ideas were shared considering whether the committee recommendations should be for regulatory changes or a request for more guidance. 

Panel 1 Themes

Key points from the agenda were used as a guide.  De-identified data that has been released can be re-identified. There needs to be technical and administrative solutions for de-identification.  To date, no ethically sound study has been conducted regarding efficacy of removal of the identifiers.  Therefore, there is a need to evaluate the technical aspect of the Safe Harbor standards.  Improve practice and training in deidentification.  Create policy incentives to increase the application of better de-identification “tool kits,” even from a data fidelity perspective.

Address the balance of when to re-identify and what are the options.  Synthetic datasets could be a good method, although there were concerns about usability.  Models associated with usability were identified as: a) the semi-trusted analytical sandbox (the enclave model), b) the synthetic private dataset, and c) synthetic differentiality.  One committee member stated that science exceeds practice.  We need to raise the sophistication of practice as we know it today.  Another member commented that if it is discussed as translation, it will be beneficial for funding sources.  However, further clarification pointed to translation in the provider world noting how cumbersome the de-identification process is in healthcare organizations.  It was articulated that some cost/benefit analysis show that it is not worth doing the research.  Fears around risk mitigation are currently hindering research.

It is expensive to use the expert determination.  There were opposing views as to the term “cumbersome.”  One participant thought that the electronic record system was the “cumbersome” piece; while another saw “cumbersome” as a factor because there are only five experts nationally and it would take months to engage them.  An additional participant noted that “cumbersome” means a difficult process, when in actuality using expert determination is too expensive.  A final comment noted that, the alternative to “cumbersome” is collecting 5,000 or more consent signatures for IRB approval, which is not applicable for business purposes. 

The risk of re-identification is real and needs federal language added to the business agreement.  Special difficulties of de-identifying unstructured data, narrative data, and new-issues regarding conversations in portals exist.  Questions were raised about integrating multiple datasets.  An additional item for discussion was the scientific discord regarding de-identification.  There is currently no oversight on de-identified data.  IRB’s don’t feel comfortable because there is no guidance around privacy.  One presenter provided a set of recommendations on data release policy and minimizing risk.  Supplementary discussion was held regarding: developing economic incentives to limit risk through deterrence; addressing the four tenets of reasonable data security; and the absence of standardization concerning data suppression. 

A discussion ensued on shifting focus from preventing harm, to minimizing risks.  Additional notes were shared pertaining to differentiating risks from data that is accompanied by data user agreements versus data on the web.  A participant noted that user agreements are employed in all cases.  There was a difference in opinions regarding the risk of re-identification (in the case of statistical de-identification) and the lack of administrative control.  The language in user agreements is most critical. 

Therefore, a recommendation was made to have a discussion on best practices in building a data use agreement. Other topics included: the distinction between the use and access of the data by covered entities versus a non-covered entity.  Data from covered entities should be revisited annually.  Address concerns that focus on having different rules for the same data, and those that have no standards in monitoring apps.  The laws are sector based.  If you go into different sectors, the same information has the potential to be treated differently. 

Panel 2 Themes

Individuals want more control and use of their data.  Flexible data sharing protocols with government oversight is vital.  Needed are more robust risk assessment tools and processes to be applied to analytical databases.  Governance and ethics has to be deliberated.  Frame the discussion in terms of statistical risk identification, harm and the value to society.  Lifecycle was highlighted around best practices with privacy by design.  There were questions regarding clarification around users only collecting data that is needed and disposing of data as it becomes less useful.  Also discussed were contractual limits, solutions, and audits, and reputational harm. 

Another topic included potential rules for data linkage that can be applied to HIPAA data.  FTC called on entities to take reasonable steps to de-identify, make a public commitment not to re-identify, and to enter into contracts to not re-identify.  The PCAST report calls for greater consumer transparency.  Consumers can expect data to be used for the purposes collected.  However, when the data collector chooses to use it outside of that specific context, another consent is needed.  Also, opt-in and opt-out should be made available, and policy guidance for consumer notification needs consideration. 

Panel 3 Themes

Discussion ensued around genomic data emphasizing two points to consider:  1) HITRUST as a recommended framework; and 2) a crosswalk with other models to determine best practices.  Distinctions between real harm and perceived harm are needed.  When data is re-identified (not necessarily used in a malicious manner), the question becomes, who gets punished?  The dissemination or the one that created the bad action? Machine learning is connected to this topic as it relates to developing written agreements.  De-identification conflicts with data exchange regulations.  Providers want use cases to help them understand the policies and clear guidance instead of having to negotiate that tension.  The presenter also mentioned that patients are comfortable with the use of their data. 

Discussions continued around issues previously captured such as privacy and big data collection, business associates agreements, cost/benefit analysis and de-identification, and genetic information.  Immediate attention is required for providing guidance as genomic data have become part of EHRs.  People should be educated about implications of that data.  Risks exist as more information becomes publicly available, there needs to be possibilities of re-identification.  The educational function of this issue concentrated on the importance of having patient advisory boards for de-identification, providing guidance in decision making, focusing on transparency, and having trusted data sharing relationships with control. 

Panel 4 Themes

The committee revisited managing risk, examining de-identification, auditing, and contracting. HITRUST was described as a framework that was created as a model for privacy.  Developers of this product synthesized a number of different security regulations.  One committee member mentioned that although there may be some issues, it is a good idea to explore how it works.  Axiom looks at consumer expectations from a privacy impact assessment approach.  They check to see if it is legal, fair, and just.  During presentations, one panelist explained the process as governance and accountability against a set of social values. 

SUBCOMMITTEE DISCUSSION:  Frame letter to the Secretary, reach consensus on the timeline and next steps

Linda Kloss asked if the subcommittee thought there was enough information to start framing a letter.  There was agreement that a sufficient amount of information had been presented to describe the current state, existing issues, and recommendations.  An opposing view from a member stated that the recommendations were pre-mature.  The work plan calls for consideration of the letter by the full committee in September.  A conference call was recommended to discuss minimum necessary.  Ideas for a tentative agenda followed. 

Linda Kloss asked a subcommittee member to share information about previous conversations concerning minimum necessary.  His feedback detailed how most people shared reactions of not feeling comfortable having a discussion on the topic because it is misunderstood.  OCR has seen many complaints and should provide guidance on this topic.  In the upcoming committee meeting, the panel is scheduled to begin with Dr. Rothstein giving an overview followed by Bob Gellman with the legal perspective, Adam Greene formerly of OCR, and Marilyn Zigmund Luke from AHIP.  Panel 2 added another speaker from Healthcare Compliance Associates, which will focus on practical implementation of minimum necessary. 

The panels that have been formed illustrate how poorly understood these concepts are in the industry.  A brief conversation followed with discussion around how to answer questions such as:  How have these concepts of minimum necessary evolved and how are they considered?  How are they being framed in the HIPAA law?  What should guidance look like?  Was this initially collect only what you need or disclose what you need, or both?  Consensus was achieved for scheduling a 60 minute conference call in the upcoming week to discuss next steps on both hearings.

Adjournment  4:05 p.m.