[This Transcript is Unedited]
Department of Health and Human Services
National Committee on Vital and Health Statistics
Working Group on HHS Data Access and Use
February 25, 2015
National Center for Health Statistics
Auditorium
3311 Toledo Road
Hyattsville, MD 20782
Proceedings by:
CASET Associates, Ltd.
caset@caset.net
TABLE OF CONTENTS
P R O C E E D I N G S (1:04 p.m.)
DR. MAYS: Good afternoon. We are going to start the meeting of the NCVHS Work Group on Data Access and Use. My name is Vickie Mays; I am at the University of California, Los Angeles. I am the Chair of the Work Group. Why don’t we go around the table.
DR. FRANCIS: I am Leslie Francis. I am at the University of Utah and I am a member of the Data Work Group, and as far as I know I don’t have any conflicts.
DR. COHEN: Bruce Cohen, Massachusetts Department of Public Health, member of the full Committee, Co-Chair of the Population Health Subcommittee, member of the Data Work Group, no conflicts.
DR. ROSENTHAL: Joshua Rosenthal, Road Map, no conflicts, Working Group.
MR. SOONTHORNSIMA: Member of the Work Group, no conflicts.
DR. VAUGHAN: Leah Vaughan, member of the Work Group, no conflicts.
MS. SQUIRE: Marietta Squire, staff to the Work Group.
MS. KLOSS: Linda Kloss, member of the full Committee and Co-Chair of the Privacy, Confidentiality and Security Subcommittee.
MS. JACKSON: Debbie Jackson, National Center for Health Statistics, Acting Executive Secretary for the National Committee.
MR. PORTNOY: David Portnoy, entrepreneur in residence, HHS, Idea Lab.
MS. HORNSTEIN: Rachel Hornstein, ASPE data policy.
DR. SUAREZ: Good afternoon, everyone. I am Walter Suarez, member of the National Committee and member of the Data Access Work Group.
DR. MAYS: And online — Kenyon?
MR. CROWLEY: Good afternoon, this is Kenyon Crowley with University of Maryland, member of the Working Group, no conflicts.
DR. MAYS: Thank you everybody. I know that for some people travel in has been more than a notion, so I appreciate my colleagues who have probably had to come through ice and snow and cold and what have you, so thank you for being here.
Let me just do an overview of a couple things. One, I want to make sure everybody has the agenda. Two, I think that we have slides. I did a presentation earlier so I want to make sure that you have a copy of that because we are going to refer to it. For you it won’t be new. It’s the kinds of things we have been talking about on our conference calls.
Three, did we make copies of Kenyon’s slides? Thank you. They’re in the folder so you should have all the materials.
Normally we start with an update, but it worked out perfectly. Damon has to take a 1:00 o’clock call so he is out taking that call. Kenyon can only be with us up until 2:00, so we are going to start our process and then we will do the update a little bit later.
As many of you know, part of what we have been tasked with is to think about giving guidance to HHS data producers on behalf of data users. When we think about this issue of guidance, one of the things we have done is looked at the literature. We came up with an excellent article that Kenyon provided us in which we looked at some of the impediments, some of the barriers. Part of where we want to start is thinking about those challenges and coming up with solution areas.
Kenyon, because I want to use your time maximally, everybody here knows the back history so I’m going to let you get started.
Agenda Item: Data Communities and Best Practices
MR. CROWLEY: Thank you, Vickie. I apologize that I can’t be there in person, although I am about five miles down the road.
I am going to walk through this brief presentation. This is a follow-up to our earlier discussions on the development of a guidelines document. Various members of the Work Group individually or in teams were asked to look at specific areas, and one of the areas that I was asked to look at was around some of the challenges and impediments to having the effect of sort of an open data landscape that we envisioned through our charter and from the discussions. For the next 10 or 15 minutes I am going to walk through eight challenge areas that have corresponding solutions as a potential for a framework for how we think about structuring the guidelines and solutions work.
As a little more background, the way this is being framed is thinking about open data not just as a product or an output but really as a process, and that process starting from collection, how it’s published, how it’s described, what HHS and the community can do to make it findable, understandable, actionable, and just an understanding that that will be constantly evolving and will need to take in information from the market to be most effective.
I should also add that these contents I am providing are not solely mine; they are coming from many of our work group discussions we have had to date, especially some of the stuff that Josh and Bill have provided before, as well as some of the current literature on open data effectiveness. So process is the background.
The value from the data will be maximized. As I was saying, how can people find it, how can we make it understandable for them, and how can they take action with that. And then understand that when we talk about users there is a lot of diversity in the users, diversity both in how they engage with data, what their literacy level will be, what their technology sophistication may be. As we think about this it is probably not a one size fits all approach, but there needs to be adequate consideration to the different user types.
We spent a lot of time talking about data entrepreneurs and the public health community, some with consumers, but there may be others. So again, just as a framing, we need to think about what are appropriate ways for the different communities. It’s not necessarily that all data needs to be presented to all users in certain ways, as I think we would agree there are not infinite resources at HHS, but there should be thought given to how we can create that access.
If anyone has any questions or wants to chime in, please feel free. These next two slides are some of the key concepts that have been elucidated from the discussions and from the literature and I will walk quickly through these.
One is the issue of availability and access. When we talk about availability and access this means both physical access as well as intellectual access. Physical access may be the document’s location or the formats, the abilities required, whereas intellectual access includes how is that information categorized, organized, displayed and represented so that the users can understand it. That is at the core.
And thinking sort of in the future, we may even expand our definition of access to social access and understand that different communities can interpret data differently and may have certain biases or certain engagement practices that may moderate access efficacy.
The next key concept — and again, these concepts are based on some of the challenges — is findability. This is at the core of open data. Pathways to the data need to be clear, should be provided with simple, easy-to-use tools. We may think about providing insights to data seekers specifically by audience; i.e., whether they are a researcher, an entrepreneur, et cetera, and also by the types of questions they might be trying to answer and how that data can facilitate that use.
Indeed, there should be an ongoing assessment of how users are actually reaching the data or not reaching the data that they are seeking to reach so that the most efficient channels can be promoted, and for those channels that are not demonstrating efficacy in people finding them modifications can be implemented. I would also say as sort of a key concept around findability, too, is the metadata, say, with respect to datasets. We spend a lot of time talking about socially tagged metadata, about users, content, applications, and that is key.
The next key concept for the challenges we should be thinking about for our guidelines is usability. How can we increase the ease of use, the satisfaction with the data and the ability for folks to use the data in forms that are most familiar and in which they are versed in. That might be for an open source analytical tool such as R for researchers. We have APIs for developers, and there may be data dashboards for consumers or for public health. Many folks use Excel as their primary tool today. Just getting to that usability, common, standards-based, consistent and easy data use practices should be implemented.
In the usability domain as well, as we look to the future, we may also think about what are the ways that we can actually catalog to research queries sort of in the concept of open source algorithms so that researchers are able to create value from the data, and using these algorithms should they choose — not be required to but should they choose — they could share, reuse or customize those algorithms for individual community use.
And as a last point, terms of data use should be very clear, simple and consistent.
The fourth content area for challenges is related to usability but there are some nuances, and that is the comprehensibility of the data. Data description should also be provided in clear language along with corresponding data dictionaries, entity relationship diagrams, questions that may help answer their questions, how the data has been used including the links to any results. Commentary or publications associated with that data can be very helpful in supporting learning around that data. This is another area we have had a lot of discussion on, Josh in particular.
And comprehension — I think that, too, works for the greater use of visualization. For any audience it is showing to be an effective way to pick out insights from the data. So those are the first four. I can go to the next slide or, Vickie, would you like to wait until the end to have discussion?
DR. MAYS: No. I think what we will do is take questions, but I’m going to take the prerogative of the Chair in case we stay too long because I want to make sure that you can finish. But I think we should take questions, suggestions, et cetera, on this section. Walter?
DR. SUAREZ: Thank you. One concept that I didn’t see is usefulness. You talked about usability but it seems that usefulness is a different dimension in the concepts. I don’t know if you were looking at quality in the next slide in the next set or if you have any comments about that concept of usefulness or utility, that element separate from usability.
MR. CROWLEY: The point is well taken. Usefulness in some contexts is sort of lumped in with usability; for something to be useful it needs to be useful, understandable. There are a number of heuristics you can define it by. But I think for the actual usefulness, that does touch on some of the next concepts on the next couple of slides.
DR. MAYS: Walter, do you see usefulness as kind of hooking in with other discussions, and that in the literature — because that’s where we got many of these and it is wrapped into others — do you think that we should really pull it out?
DR. SUAREZ: As I see it, I think it is going to be one of the very critical and most valuable elements of data access and use. I don’t know that I agree necessarily with combining usability with usefulness. The usability, how easy it is to use — something might be very easy to use but its utility, its benefit for the purpose for which it was originally created or made available, might be very low. I would suggest perhaps considering it as a separate concept.
MR. CROWLEY: I think the points are good. When I think about useful in this context it means can users do with the data what they intend to do. Usefulness is also one of these cross-cutting concepts. For it to be useful, first they need to have access to it; they need to be able to find it, and then they need to interact with it in ways that fit their cognitive processes and the tools with which they are familiar.
I agree we can highlight that or find a way to make it more specific, but I think usefulness is cross-cutting across many of these dimensions.
DR. MAYS: We are going to get a few more questions here.
DR. ROSENTHAL: Hi Kenyon, this is Josh. This is excellent work, such a great summary. I’m kind of reeling from where we were two years ago to where we are today and that’s fantastic. I have just a couple little points.
I would love to see this kind of polished a little bit and then another layer of detail, literally, what are examples. Take an existing HHS set and walk through what we mean by each of these.
On usability, I think we should at least add or give a tip of the hat to browsers. We talk about APIs and the tech-heavy things, but data browsers where it’s literally someone with no coding and even very little analytic skills can have access to the data. We talked about different ways to put that out there. You can put it in an enclave, you can put it in an ADI, you can also put it out in a data browser like Google or Tableau or something like that where 16-year old kids can do co-morbidity exploration.
On comprehensibility, I would agree that usually you do break out utility versus usefulness or meaning or applicability or call it whatever you want to. You can have an incredibly meaningless set that is very well —
DR. ROSENTHAL: Yes. We have a very simple set that says what color paint is there on the wall over every building here. I think one of the key tenets for the charge is how useful is the data, and honestly, that is cut by community. And what I mean by that is, you know, to Damon’s scorecard point, we talked about this in a variety of different ways. Are there certain sets that have more inherent value for different communities? As I go through the list of what are the HHS all-stars, are there certain sets that are incredibly valuable for population health, are there certain sets that are valuable — say HHS releases some news thing and basically says, hey, we want to move 50 percent of our spend to pay-for-value instead of fee-for-service over the next 3 or 5 years. Well, there are certain sets that are sitting in HHS right now that are incredibly applicable for that.
DR. MAYS: So what is the blurb that you want? You want it to say more of a definition of what is here, or do you want it to be more of an example of how to do it and what to do it with?
DR. ROSENTHAL: I think both. We have these definitions and that’s great, but we should also work through tactical examples. I guess I’m advocating for another category, which is usefulness, and that is outside of this. That gets to the core charge in my opinion. Does HHS want to spend resources with datasets that are not going to have that much meaning for various communities, or do we want to say, based on the context of ACA, based on the context of other things, there are a certain number of sets which have amazing value for people.
That kind of gets to that thing around mad libs or whatever, to be able to say, hey, I could use this dataset for blank; I could use a behavioral risk factor surveillance system to out-predict claims analysis for total cost. Hey, that’s crazy. It’s in the literature; it completely circumvents a bunch of risks. So basically taking existing sets and saying these sets have more use, or they have more meaning or they are of a higher value, however we are defining that — the value of the content of the data rather than just the structure and relationships therein. Does that make sense?
DR. SUAREZ: Yes.
DR. MAYS: I think it’s one of the things that in follow-up we should probably put on for a conference call to really discuss because there are pros and cons about that. It may be that something is useful to a very small number of people but for a very important problem. I would hate to have us put something out and what happens then is that people come and say, well, let’s get rid of these six datasets because they seem not as useful. But there has to be something else, and that’s what I think we need to hear from HHS.
DR. ROSENTHAL: Damon has talked about the prioritization of sets. David is literally going to present on what I’m talking about. He is literally going to say, hey, crazy idea for HHS; maybe we should rank the usefulness of these sets and basically have the communities basically say we think this set is more important than that set, because with finite resources where are you going to spend your money.
DR. SUAREZ: Just a quick follow-up. Your point about it might be useful for a limited amount of people or for a limited purpose, I think that applies to all of them. It might be accessible only to a limited amount of people. It might be available to only a limited amount of people. So that concept is not unique to usefulness. It is characteristic of all the concepts and it goes back to the prioritization process.
I have another term besides usefulness and utility – how meaningful it is.
DR. ROSENTHAL: In the literature and in these discussions you can fall off the other side of the donkey. You can basically start crossing things off and you definitely don’t want to be in that scenario. But if you look at a dataset and say who does this hold meaning for, then you can say for researchers we have five sets, for communities we have zero sets.
DR. MAYS: Yes. I like that a little better.
DR. ROSENTHAL: We are not serving this population. And the real trick of it is you can take an existing set and say this has meaning for different communities. So behavioral risk factor is such a fantastic example of this has meaning for population health, this has meaning for communities, and guess what. It also has meaning for payers and market, et cetera. I can literally use that set for an unintended use, and certain sets have meaning for more than one community.
DR. MAYS: Bruce, I want to ask you a question, and maybe Bill can answer it. Would this get into what your taxonomy would actually help to answer as well?
DR. COHEN: I don’t know whether we have the same opinion. Meaningfulness to me is more of a qualitative judgment than I think the taxonomy is intended to include. The taxonomy I see as trying to be more objectively descriptive about the data contents and structure. Meaningfulness is a really slippery slope, and there are a lot of other things for me that are more important.
We could see how often different datasets are used to get a sense of how much resources we might want to be putting into them. But I’ve been involved in situations where because of budget cutbacks, we cut back a dataset and nobody cared for three years. Then, all of a sudden, where are the data that we had, because we now we need them.
Maybe it’s the way I live my life, I never throw anything away, but I feel that way about data and it would take a lot to convince me that data that somebody started to collect for a particular reason is not potentially meaningful to us now or in the future for a particular population or purpose.
I guess a lot of these concepts — I think Walter brought up a good point about usefulness, but there might be objective measures, some more heuristic if we go to grading, but I’m really very wary about going down the meaningfulness slope.
DR. ROSENTHAL: One nuance — you don’t have to do it top-down; you can do it bottom-up, and you could actually pull a folksonomy into a taxonomy and basically say, hey, if you have users self-identified as belonging to community and you hit downloads by that, you don’t have to decree this dataset is good for community and this dataset is good for so-and-so. You can literally look at the usage, put bands on it and say this dataset has the most usage by these types of folks. Then you can basically build that in kind of bottom-up folksonomy. That’s why if you have that learning center it’s like a key thing to the input.
MR. CROWLEY: Some of the HHS sites have started to do that. I think a good example of that is the CMS Data Explorer. You know how many downloads it has for each dataset, and they have these tools built in.
DR. ROSENTHAL: That’s why having an example of this, say — let’s take that, have the users self-identify — I’m population health, I’m market, I’m whatever — and then say, oh, this has 5,000 downloads of pop health people, this has two downloads of community people, this has one download of business. You can start to put some very basic inputs on it and be able to build up. This is a fantastic dataset that somehow people in pop health are at least using it.
Then, if you are able to put in the mad libs, you can use this set for this or that, and connect it to a learning center, then you can really tie in the qualitative meaningfulness.
DR. MAYS: This is kind of another approach to social tagging. Leslie?
MR. CROWLEY: Right. And you may see that users you didn’t anticipate using it are the ones actually making the most use of it in ways you didn’t think of, and that helps frame what is needed for development or marketing.
DR. ROSENTHAL: All your fears are well founded top-down exclusively; there is another approach that can find a middle ground.
MR. CROWLEY: On that point, maybe something else that is not highlighted here — this is sort of our first draft of bringing together these concepts. But maybe more explicitly in terms of our challenge of what we need to do is sort of the analytics around the datasets that are going to be most instructive for HHS and the broader community.
DR. ROSENTHAL: Literally, let’s sketch out an example of it. I think that is probably a good next step in all of these pieces.
DR. MAYS: I think we are really developing what the next work is in terms of taking these concepts, expanding them and adding some examples, as well as a use case and potentially the analytics.
We are going to let Leslie get in, followed by Bill followed by Walter.
DR. FRANCIS: I wanted to make a stewardship observation which I think is that we shouldn’t lose sight of how it can be furthered because I think it actually can, in some quite lovely ways, be furthered by things like findability, usability and comprehensibility. It is openness and transparency. One of the really important features of data stewardship is making sure that people are aware of what data there are and how those data are being used. So something about when data are findable and usable, and it is actually much more transparent. I’m just suggesting that is a dimension to findability and usability. I would call it transparency for now. The goals just go nicely in tandem.
DR. MAYS: I think when we talk about data stewardship, that is going to be kind of the opening frame, and it may be that when you’re talking about, in that opening frame, openness and transparency and then how these are examples of that, I think that would be great.
DR. STEAD: I will try to answer your question about relationship to taxonomy. In our discussions of the framework, everything started with what is the purpose of the dataset, and what is the purpose of the primary dataset as captured for its primary purpose, and what is the purpose of the secondary use. Those are distinct. So usefulness would have to be tied to purpose, and primary use purpose and secondary use purpose would be distinct.
The other thing we got into that relates to usefulness was a discussion we had at the framework workshop around timeliness. It had to do with how long did it take for the data to become fit for the purpose, and what was its shelf life, how fast did it degrade. Those were aspects where you need to match the timeliness of the primary or secondary use purpose to that aspect of the dataset. So it gets complex.
DR. SUAREZ: I was actually going to raise that exact point. In our letter from March, the Data Access and Use letter to the Secretary in March, the three items we covered were usability, use and usefulness, and in usefulness we focused on and highlighted timeliness both in terms of how close to the event the data is and then how it degrades. So I think that concept of usefulness again comes back from previous statements that we have made.
DR. MAYS: Let’s take the last question on this and then we will let Josh have the comment and then we will go back to Kenyon.
DR. ROSENTHAL: This may seem a little silly but I find it pretty helpful with these framework exercises. Whenever we do this, either for internal or for external use, can we translate the nomenclature to something simple that everyone can understand? As a repentant academic, I’d like to say availability and access is can I get a hold of that. Findability is can I find what I need, can I discover the unanticipated. Usable, can I work with it. Comprehensive, can I understand it or its output. Quality, can I rely on it. Linking and combined data, can I use it in conjunction with something else? Support, am I alone in the unit. You get the basic idea. I’m making it really specific.
DR. MAYS: That then makes it easier when we do the use case for the consumer.
DR. TANG: I wonder if someone could send me the slides that you presented to the Committee, as you said you were going to refer to them.
DR. MAYS: They are doing it now.
This is very good. I think we have a sense of what we need to add and tweak, and I think this is getting us further along. Walter, thanks, because we should probably refer back to that letter and I had totally forgotten about it.
Okay, Kenyon.
MR. CROWLEY: Thanks. Great comments.
For the next four concept-related challenges for the data, one being the quality of the data — that is thinking about assessments for the completeness and cleanliness of the data. Actually, for each of these areas we could write probably a 20-page position paper on them, so this is just sort of a high-level framing and next steps I’ll talk about in the closing, but we can sort of pick and choose how deep and wide to go. But quality is one because, as we know, there are varying levels of completeness and cleanliness across HHS datasets which affects end users.
The next concept is on linking and combining data. To the extent that we can facilitate the ease of use of or putting data together, that increases the desire in many ways. There are so many datasets right now that are on similar topics. There are some that are not as similar but have constructs maybe merged to find additional meaning. To the extent that we can facilitate that in ways that are also sensitive to privacy concerns, that should be an area that we delve into.
Some of the things are linking of data. There’s lack of standards, and for some types of data that could be linked. But there are ways to merge data together. Indeed, as we are doing this and thinking about how can we perhaps catalog the suitability and experience of doing the linking and combining of data for different datasets, to expose that and make that transparent to the communities that we are serving who would need to do that linking of data.
The next concept — and this is one that on our last call there seemed to be a lot of vigorous discussion about — is on data provider support. The literature and the community as a practice and all my communities, these have been around for a long time now. One of the key ways that value is created within these communities is having sort of dedicated support from the data provider, whether that’s a community data specialist or manager, but it’s their responsibility to be a part of the data community to provide subject matter expertise, customer service, monitoring and moderation, evaluation such as the metrics and reporting that are important for different datasets, and then recommendations for implementation improvements to the dataset.
Just personally, in using past datasets, it has been very instructive to have a point of contact to answer questions or get feedback from HHS. I know that for certain agencies now there are help desks, but to be more explicitly providing data expertise to the community for each dataset and facilitating a community as a sort of trusted voice in that community I think could have big benefits to the goals that are sought in the charter.
The last concept on this draft is the community-building and learning piece, which is another area we have talked about a lot. I think we agree that a vibrant community around data creates outcomes that we hope for in terms of learning, reuse of the techniques, being able to apply the data into applications and other programs. This community-building and learning piece is an area of big opportunity for many of the HHS data people.
We have talked about many of the constraints given being a public body, but whether the solution may involve some sort of public-private partnership or other ways of fostering community in terms of amplifying its value, I think that is an important area we could probably come up with some recommendations for.
Again, this is an area where models and communities of practice have been around for a long time and have best practices, and there are likely ways we could help apply some of those best practices to HHS data to facilitate a learning health system and community.
In closing, open data policy and resources can continue to be assessed using these concepts and maybe additional concepts, whether that’s use, usefulness or others, but these are ones that have been identified as particularly important in opening data for greater value. Some of these components may be most suitable for NCVHS to adjudicate and supply recommendations.
With that, I will pause.
DR. MAYS: Let’s start with Damon, and then we’ll work the table.
MR. DAVIS: I have a couple of questions. First, thanks for pointing out many of these concepts. It is something I have been thinking about with many of my colleagues across the department. I’m going to skip over quality for now — not that it’s not important, but I want to definitely get to the other three.
In the first one, linking and combining data, you mentioned that it would be valuable to expose the experience of linking the data to the community. Did I catch you right when you said that?
MR. CROWLEY: Yes. What I mean by that is some datasets are set up so that they are in a current format that is more easily integrated with tools such as R or data browsers or have existing cases documented as them being effectively merged with other datasets. So, in essence, having a way to understand which datasets are most able to be linked and combined, or have been most successful being combined with other datasets.
DR. MAYS: I think that was a comment earlier when I was presenting about letting people know which dataset in which you don’t have to create a variable in order to be able to link it, but the data is right there or the tool is right there so that the person could actually do what they need to do, link the two things and have an answer then, as opposed to having to download something, go get another tool, or that the data is in a form where it takes a lot of cleaning and massaging before you can do it.
MR. DAVIS: That is helpful, because that has been one of my interests and concerns, is how much should the Department expend its own resources to try to create the data linkages that someone thinks will be meaningful versus sort of pointing the way down the path to say you could do the data linkage with these two datasets based on the availability of these two tools and the utilization of this field or that field, whatever.
DR. ROSENTHAL: I think you will get 90 percent of the bang for your buck, because this is what I do every day, just exposing the metadata. Part of that is if I can just see the metadata I can see this set has HR and that set has HR — that’s all I need. The problem is I can’t do that unless I link and sift through it, literally. Just exposing the metadata will allow 80 percent of your average users to be able to do that on their own.
MR. DAVIS: I see the value in that. You are right. Having them actually get the data and download it and then see the possibility — that’s a long, extended resource life cycle.
I have just a couple more quick things. On the data provider support, one of the things that I have been concerned about with my colleagues across the Department is that as we put out more and more data we have more and more people that are interested in it, and while building a community will be fantastic, I am also a little concerned with quite literally just the burden of having to potentially answer a bunch of questions. Some of them will be awesome questions and some of them will be one-offs where someone basically could have read the documentation.
I guess what I’m saying is that while thinking about the community-building component, I wonder if we will burden the folks in the Department to be responsive to the community. Or how much can the community be supporting itself — you know, the knowledgeable people within the community answering questions across the domain versus necessarily relying on the smee within the Department to provide the answers.
So I would love recommendations from you guys along those lines towards the community-building piece because that is, in fact, one thing I have been asked to think strategically about. If we didn’t liberate another dataset, we could probably get a tremendous amount of value from the opportunity to create a larger, more robust user community for the data we already have. And we could get a really strong signal for what is of value to them if we were to have this community stand up and be vital.
I would want to make sure that I get some recommendations from the group as to what those practices, Kenyon, that you mentioned are for community building because I think that would be very valuable for us to take and socialize across the Department.
MR. CROWLEY: I think you are absolutely right. To be effective, it needs to be a community-to-community relationship in terms of the community sort of helping itself. I think good examples of that are product forms like the Google product form does a good job of that, where it’s the community that is answering many of the questions but at times you have the Google expert from Google say that is right, that is not right, or sort of point them in the right direction.
And what is the right framework for that I think will take some serious thought.
MR. DAVIS: A medium amount of moderation is what I’m hearing from you.
DR. ROSENTHAL: This is like Dell product service back in the 1990s where you have Dell users outperforming the technical Dell reps. You need three things in what you’re asking for — ability to identify yourself as part of a community, whoever you pick from a dropdown menu; you need a level of self-identification of expertise and self-identification; you need to have your additional users be able to rank your answers on how helpful they found them, and you need that ranking computationally imputed, just your total sum, to you. Just like Amazon. You do those three things and you will wipe out 90 percent of your form moderation burden.
DR. MAYS: When we were on a call last time, we were actually struggling with this and talking about ways in which the users could actually find when things had been previously answered — add to being the person that answers the question — that it’s not always the data provider that has to do it first. The data provider actually should be moderating and not necessarily coming to the aid for everything but actually — you will begin to see that if we look at Josh 27, oh, he answers these questions very well. You begin to get that, and I think that’s what we were talking about as the operative for this. Otherwise, it’s not going to happen.
DR. ROSENTHAL: A little bit of flagging. So if somebody says someone totally screwed this, I can flag that, and then the flagging cue should be the moderator’s primary work.
DR. MAYS: I think we are also on the same page with that. Bill, then Leslie.
DR. STEAD: An example from a very different space is the search and filtering service in our biomedical library. They maintain a frequently-asked question base that has the questions that have been asked, the answers, and then the search strategies that were used to achieve those answers. So that’s an example of a packaging that makes it easy for people to both get the answers and have the start of what in this case would be the analytic process.
DR. MAYS: That would be something we would recommend to whoever is the data provider to do, and to have those analytics in the background to be able to be used to evaluate it.
DR. FRANCIS: Probably the most difficult of these from a stewardship perspective is linking and combining data. At least part of why is it’s technically very hard, but at least part of why points to a dimension of all of these that we need to attend to, and that is that none are static. With respect to data linkage and combining, what’s really an additional dimension of how hard it is, that linkage isn’t just the first person who downloads the dataset creating a linked set; it is that linked set then being used downstream by someone else.
It is also that new data from other sources are forever coming available, so a linkage that might have been entirely innocuous from the perspective of being able to identify individuals, when a different dimension of data becomes available from another source, might become problematic.
Now, how to get your hands around that I don’t have an answer to, but I wanted to use this to point out that none of these are static. Not even something like comprehensibility because the sorts of tools for understanding might change, or the cultural vernacular might change, or whatever.
I think one of the very hardest is the linking and combining data, and because of the potential risks there — I guess I’ll put it this way. You don’t want to have a risk of linking that’s purely theoretical to chill data availability. On the other hand, you don’t want a risk of inappropriate data linking to happen, which would thereby also chill data availability. So, some way of trying to understand what is going on downstream.
DR. MAYS: Yes, it’s one of the things that we probably somewhere along the line need to spend a bit of time on. What we can’t control are all the mash-ups. What we can’t control is that this is a time when people are putting data together that we would have never thought. As I said, I went to the police brutality one and the things they were putting together I would have never thought to put those together. It’s so out of our hands that I think what we have to do is to figure out what the warning or limitation or something is.
Josh, and then we’ll cycle back around.
DR. FRANCIS: Could I just add to that? Just even knowing what mash-ups are happening may be as important as being able to control.
DR. MAYS: I don’t know if there’s a mash-up list. I just have a sense that you hit, you run, you do it and then there’s something different the next day.
DR. ROSENTHAL: There are three tactical things that I think would be really helpful. One is basic ideas of usage. If you’re going to say HIPAA is this, blah, blah, blah, and technically there are clauses where it falls under if you’re doing learning imputed to population even if it was HIPAA protected. Just basic stuff like that saying, hey, if it’s in public use, unrestricted, it falls under this; it doesn’t fall under that. Just some basic guidelines that 99 percent of people aren’t aware of. That would be very nifty in a learning center.
Like just because you are using de-identified data, if an input was a HIPAA-related set and you’re using it for like an abstraction — by the way, you still fall under HPAA, right? I guarantee nine out of 10 people have no idea about that.
The second thing was it would be good to put this up with Damon’s scorecard and say where does the HHS internal scorecard stack up against this. Is it usable? How are you looking at it? So we can be in tune with that.
Finally, also in terms of a work flow, where can we be most helpful? You tell us. Is it five things to do with the community? Is it release notes or release packages that follow this template? Literally making sure that we sync up.
This is all in the shadow of — you know, post the Barclay comments, thinking this is still worthwhile. Charles Barclay said analytics is crap and the whole world blew up around that. I still think we should do the stuff.
DR. MAYS: Kenyon, this was great. I think it really pushed us along. It gave us next steps on this which I think really is about trying to generate first some easy, what are these things, how can I use it, et cetera, in terms of the terminology.
We have expanded a few that we want to include. We need to think about technologically how do we do this, to explain that, and then to say a little bit about what it is. I think at the end of this we’ll try to figure out what our next things are.
Anything else before you have to run off?
MR. CROWLEY: Not today. Thanks for a great conversation. I look forward to participating after the call-in. Just give me my marching orders and I’m ready to go.
DR. COHEN: I am going to use your key concepts framework to provide my community perspectives. I think this works really well. I’m sorry you are going to miss it.
DR. MAYS: Thanks, Kenyon. We have had new members join us. Hi, Paul.
DR. TANG: A quick comment, Vickie. When Josh mentioned these quick tips, he blew by his example of combining a HIPAA dataset with your other de-identified data and it raised my awareness of how simply people could get into trouble. I just wanted to emphasize that point he made. There are a few things everybody should know today but really aren’t aware of, if you are new to data. I just wanted to reinforce what he was saying. That example just struck me — of course, that could happen.
DR. MAYS: Okay, we will have the “this is how to keep yourself out of trouble” tips.
MR. SAVAGE: Sorry to be late. Mark Savage with the National Partnership for Women and Families.
DR. MAYS: Okay. Hopefully, what we are going to do is to be able, particularly our subcommittees, to think about these things we’re talking about and be able to see how they interleave with the things you are doing in terms of the taxonomy, your community data needs, et cetera.
The next thing we want to talk about is data communities. Kenyon started talking about that, and, Damon, you were also talking about this whole notion of wanting to build these. Here is where I think we ought to move to.
Bruce, you gave us material. It was great, so we want to get you back into the discussion with us. We kept saying who is the community, which is the community we are talking about, so I want to start us here. Let’s start this way. This is leading the discussion. Everybody gets to participate but it’s called leading the discussion.
Bruce, if you could talk about what community you are focusing on. Josh is also going to talk about it. And then I would like Damon to also share with the group what you were talking about previously about wanting to think about building these communities so that you can get better feedback and develop some datasets better than others.
DR. COHEN: What I thought I would do to sort of facilitate and provoke discussion, from my experience working with communities, is give you my community perspective on these key concepts.
My perspective on these concepts I guess is informed by participating in community needs assessments in several different communities, developing a web query system at the community level for all 351 communities in Massachusetts as well as neighborhoods in larger communities that include about 37 datasets covering chronic diseases, health, graduation, Department of Education, transportation, because all these are part of the community needs when they think about health.
My third bona fide expertise is I spend a lot of time going around the state to talk to community coalitions about how to use data in their activities. You are welcome to agree or disagree with my points of view around these concepts, but let me start going through the concepts.
From the community perspective, availability and access — what communities want are aggregate data and simple measures. It is very clear. When we begin talking about individual data linkage, it’s really irrelevant to communities. They want summary, useful measures. They might want them for populations that we cannot generate the data for; they might want them for middle-aged Puerto Rican women, but they want simple measures.
Findability — the data is overwhelming. It’s amazing how little communities know unless they have data mavens that are part of their activity. Communities can round up the usual suspects. There are like 12 datasets that communities use, maybe even less. There’s the vital statistics, the cancer registry, some health survey, the BRFSS. I could list them on the fingers of one hand probably. Everything else is brought in by experts who might be local experts. You might have somebody from public safety who happens to have a child abuse dataset for your local community. But in terms of ubiquitous findability, it’s really difficult, and as data become more open it becomes more difficult because there are more choices and communities need guidance about how to know what works for them.
Usability — communities want data, both qualitative and quantitative data that they can transform into decisions to move them forward. That is the definition of usability — whether they can transform an age-adjusted mortality rate per 100,000 persons. We have had three suicides and we have age-adjusted mortality rate for coronary heart disease of 351 per 100,000. How does a community use those numbers to help decide whether they should intervene on chronic diseases or focus on youth suicide? Usability is really a function of translating abstract qualitative numbers into something that can help them make sense to discern priorities.
I have been in situations where I create the list of 45 leading indicators and the community says which one should I choose. That rate is the highest; that is what we should be working on. Data don’t make decisions; data are a helpful input. The more usable we make the data to be able to transform into information, essentially to link it to solutions that we know work, the more valuable the data are and the more successful we are as data suppliers.
This is really something that I think data suppliers, fed, state, local, all data suppliers really need to focus on.
DR. MAYS: Let me just ask the question of how do I get — see, you are here channeling right now, and we heard who you are channeling. But in terms of a data community, how do I get them to be in a place where Damon is going to hear from them or somebody else is going to hear from them so that we can make those changes? Otherwise, you are a channeler. But how do I do it in real time?
DR. COHEN: That goes down to data provider, support and technical resource support. I really think that to make data useful in community decision-making we need to provide and train communities with tools and provide them with resources to effectively own and use the data. We talked about data concierge models; we talked about using the HHS regional centers, using the CDC EIS EPI intelligence service model in this country to help data users on the ground.
DR. MAYS: Could you just explain — it was in an earlier meeting where we talked about the data concierge, and I don’t know if they know about the Ag data thing. Can you just say what it is?
DR. COHEN: These are all models. What the federal government — which is one of the largest health data providers — does really well, or has the capacity to do really well is provide resources to understand the information. This is a boots on the ground strategy. It takes a lot of work, but it doesn’t take an overwhelming level of competence to help use data. We can train people on using data and get them out into the community. It’s really linking communities that don’t have the resources.
Sometimes communities are lucky enough to have data geeks who are involved in the coalition, to have county reps who are interested in this particular neighborhood, to have local universities that want to support the improvement of quality of life in their communities. There are a variety of places where we can find expertise, but this expertise needs to be embedded in communities.
The notion of having web services or FAQs or metadata, that is nice and is a nice screening tool, but to really make these data active, I really believe in some way to get boots on the ground.
Comprehensible — there’s a lot of action going on in this area — use of infographics, dashboards, simple statistical techniques. Kenyon talked about spreadsheets. There are simple tools that can be mastered with little training. Here is where online training probably can be provided to spread this expanding basic technology for everyone to use and to master. If people can learn how to play games on the internet, they can learn how to do these things.
Charlie was here earlier and we talked a little bit about NCHS’ role in providing more data collection support. There are barefoot surveys that the feds can help train communities to do. Communities love to do qualitative data collection — key informant interviews,
DR. MAYS: What does that term mean? I have never heard that term. Barefoot surveys, I have never heard that.
DR. COHEN: It’s barefoot epidemiology —
DR. MAYS: Oh, as opposed to the shoe leather epidemiologist.
DR. COHEN: It’s essentially hiring graduate students, training graduate students to go out and collect data or implement surveys, or using community workers or community members to collect data. Low cost, low tech ways to generate community information that might not be available from the feds or traditional data providers.
But the fed’s role is training people how to do this in some kind of consistent way. BRFSS started to develop training. Since BRFSS was focused on the state and did some county estimates, many in my community said, I want to do the BRFSS in my community. I want you to either over-sample or teach me how to do this survey in my community because I want data to compare to the community next to me or to the state. So expanding the use of simple technologies to generate their own data is I think data liberation 2.0 or 3.1.
Quality — communities are not really concerned about quality. They accept the data they are given because they believe the data collectors have already examined and proven that it is quality data. This might be a shortcoming and a misgiving by community levels, but communities are so thankful to have any data they don’t really focus generally on the quality of the data that are collected.
Linking and combining data — again, communities don’t care about individual data linkages. They want to know how many pregnant teenagers or substance abusers. They don’t want to link the substance abuse intake files to the maternal and childhood and WIC program files. Linkage and combining data is less of an issue for communities that want to focus on answers.
The linkage that communities want is sort of a mental map linkage. They think, geese, 65 percent of the teen mothers in these three neighborhoods were substance abusers. Then they go from that number in their mind and say, I wonder whether Mrs. Jones’ daughter was one of them. They do reverse engineering to try to identify people.
But in community groups that are charged with making decisions about priorities for interventions, that is more the way the data linkage issue flows. I very rarely find communities that are concerned about actually identifying individuals by name other than what they already know about them. You really need to be cautious when you’re dealing with neighborhoods or small areas about keeping those conversations out of the broader discussions because they will happen.
Again, the focus is on aggregate data to help them think through problems rather than the actual data linkage. So, when we talk about data linkage, the solution at the community level is really community data linkage or neighborhood data linkage rather than individual data linkage. To me, that helps somewhat. It doesn’t necessarily solve privacy issues, but it can go a long way in abating risk.
I already mentioned data provider support. It would be revolutionary if we got these boots on the ground. The good news is communities are becoming much more sophisticated in their ability to use data. With basic minimum guidance, I think communities can take off in determining the usefulness and usability in their own mechanisms for using data to make decisions.
Community building and learning — it is a revelation to be in some of these community meetings and see people who are data-phobic, who didn’t know what a percentage was to being able to use quantitative information to help them address problems.
Also, data can be a wonderful tool to bring people to the table. In some of my community coalition work we would fund a rapid crisis center, and as we began talking about not only rape crisis but falls among the elderly, people stayed engaged and were willing to open up to consider, gosh, maybe this is an issue, too, for a variety of reasons. Then that might lead to a generalized solution around safer transportation that affects both rape crisis and elderly falls. So, using data to create community building can be very powerful and it works well.
From my jaded community perspective, that is how I would use these concepts.
DR. ROSENTHAL: Before we jump all over Bruce, can I just give a thought. Let’s put both of these out there because it’s going to be an interesting compare and contrast and I will keep it real brief.
I love this framework that Kenyon did. I think we should at least seriously explore utility versus usefulness, because that is a standard comp sci division, and you are going to hear from HHS about wanting to do that regardless, and there are ways to do bottom-up that protect hacking and slashing, et cetera.
On availability and access, yes, yes, yes. Also the indices are great. I am speaking from the entrepreneur market value, blah, blah, blah, reasonably experienced in data, not a neophyte and wanting to do something public use or public good through market forces modeled on ACA type stuff. So access and availability, yes, indices are great. When HHS puts out not just what is the one to 30-day post-discharge rate for blah, blah, blah, the rates for hospitals, AMI or something like that, they actually put it out on an index as of last year. You know, this hospital is over-indexing and this hospital is under-indexing. That is really nifty.
On one hand, you are intermediating and tearing apart a bunch of businesses that make their money doing that. On the other hand, it’s fantastic because you are already doing it internally. Do you know what I’m talking about?
DR. MAYS: I do, but what I am trying to get at is how it creates the community. I want to make sure I understand that.
DR. ROSENTHAL: You literally can’t have an exchange around is a hospital performing well or poorly if you have to do the case mix adjusting and run the model. If CMS actually says this hospital is over-indexing, meaning they’re doing more stuff than they should, now we can actually have an exchange around that. I need some normalization of index basically rather than just raw data.
DR. MAYS: I get that, but what I am trying to get us to do is to say how do we build this data community around these issues. I think what Bruce was doing was taking some of the frames and handling. The issue is how do we build these data communities. On the phone you had a ton of suggestions.
DR. ROSENTHAL: Most of it falls down to learning center. That’s the easiest way to do it, a meaningful learning center which has like bottom up. There’s a series of five or seven very specific things you want to be doing that are very simple to do. User self-identification, ranking of answers — users like ranking of answers imputed, meaning that you add it up. Do I have an A or do I have an F? Do I have 100 votes or do I have 10 votes, basically, and flagging.
DR. COHEN: Can you give me an example?
DR. ROSENTHAL: Sure. We could literally walk through it and it’s probably good to show visual examples of this. There are a couple of different ways.
One, I have a dataset, and rather than just tracking my usage I literally have a shell around it, and maybe it’s hosted at HHS or maybe it’s somebody else in private, public, or something like that, and it literally has — here is a dataset. Here is behavioral risk factor surveillance system. I can literally ask a question related to that set, and the kicker is if I ask a question, before I ask a question it asks me would I like to — through a little pop-up — would I like to identify myself as someone who is primarily interested in communities, in business, in monkeys, lemurs, whatever the categorization. And it also asks me a second thing. The second thing it asks me is would I care to identify my perceived knowledge, basically. Am I an expert, am I a neophyte, am I blah, blah, blah.
When I ask a question or when I leave an answer for somebody else, it has my name and it has my little colored blurb, and it also has a series of stars after it, or however you want to do it, and it basically says every time I leave a question you can come along, and now you can literally say you know what? Josh has no idea what he’s talking about; this is a horrible answer. I give him zero stars.
So, if you have those two things that allow you to basically say which datasets are being downloaded, which questions get the most activity from all the people, people in cohorts, and then how effective both by function and by level of expertise or community, and how effective are these people at answering these things, and if you want to get really sophisticated you can allow them to enter a tag, a free-flowing tag, then you’ll find all sorts of uses for your data that you haven’t anticipated.
Specifically, I will look at a set like behavioral risk factor surveillance and I would enter a tag up code — over-coding and under-coding — because I can use that against risk to see which payers and which communities are over-coding or under-coding. So you start to find uses around that.
Those tags might actually be the metadata tags, if you want to get really slick, and by really slick I mean something that a 15-year old can plug in and set up in 15 minutes. You could have suggested tags. So that’s literally like show me my metadata elements. You don’t need to spend all your time linking the data. Literally, I want to see the metadata elements. I want to see HRR, and then I want to type into a search box show me every dataset that has HRR as an element. All of a sudden I know which sets I’m going to be able to link through that column.
So that sort of tagging has a couple self-identifications and it has users allowing to rate other users, and it shows me that Josh has an average of zero stars per answer so I should not listen to him.
From the analytic side, like usually the output, there are all sorts of services that are free and they can email it to you or you can do all sorts of things and it says, hey, if you classify your sets — and you can either do this top-down or you can allow the users to do it — you know, these are the top ten sets for overall use. These are the top ten sets by community members. These are the top ten sets that community members have found helpful, et cetera, et cetera. That’s the easiest way to do it.
DR. MAYS: When you ask one of these websites to answer a question and you go to the website, as soon as you are finished it will ask you some questions. And then you come back and you find out was this useful, was this helpful.
DR. ROSENTHAL: If you think of Amazon, you do the star thing with product. I don’t get to rate Damon’s review. Actually now you do. They say is this helpful, yes or no. We just don’t expose that to people now. Literally, when you go into Amazon it says show me the reviews that are most helpful —
DR. MAYS: That is a really interesting way to think about, for example, different groups rating and whether or not this dataset is helpful to them. Someone was talking about then you link back and have this data manager or data whoever come back and say, you know, the people who are consumers or the people who are providers keep saying it’s not very useful. The researchers love us. Then you know in terms of your next set of efforts maybe you should really figure out can I do something different now to the way I have displayed this data to make it more useful to those groups.
DR. ROSENTHAL: Those are literally the three things you need to be able to do. What is your perceived usage level, what is your perceived community identification, and then a free tagging. If you do those three things you’re going to have a boatload of stuff come out of that. If you want to get really sophisticated, you take the free tags and you take the composite of the tags for user and usage and you apply them into your taxonomy. You literally extend a node basically and call it folksonomy and say usage folksonomy, community folksonomy and unstructured tagging folksonomy. All of a sudden, over-coding and under-coding pop up for purpose.
Then you might want to say, oh, that’s really interesting. I should actually move that, if you want to get really slick, from folksonomy into my core taxonomy and think about that. But that’s like hard core management. At the very least you want to be capturing that and using that.
And those sorts of tools are not cost prohibitive; they are used all over the place and will literally alleviate 90 percent of the work and effort. If you had basic flagging and had a moderator work the workflow you would be golden. Then you will figure out which sets — oh, here’s a set that not only the researchers like but also this other community likes. That’s really interesting. I don’t know why.
That kind of gets to this release package we keep talking about if I’m a data producer and I’m going to release something and there are literally mad libs of being able to say there are two pieces around that. One, do I know if this is going to be repeatable and if I’m going to invest resources in the future. So part B and part D, they release the thing and people use it but not as many people use it as you would otherwise expect. And part of it is, in industry you’re like I don’t know if HHS is going to release this again; why would I invest resources. Last week they came out and said yes, we are, and that’s fantastic. But you know what would be helpful for driving adoption of that? Either saying we don’t know if we’re going to do that, or we think we are going to do that, or maybe it’s just a priority, this is a data all-star. Some sort of visibility into a perceived release schedule so you’re not just surprising everybody.
And along those lines, literally the mad libs –if it’s a data owner that’s great. Or you could literally have a data owner scanning the community into this and saying I think I could use this data for this — like a little mad lib around that. Everybody knows you can use BRFSS for certain things, but here are some additional things you might not be aware of that are coming up.
If you put that thing in a learning framework, you’re going to see the British Medical Journal article on BRFSS out-performing the rest in like 10 seconds. Then you say, oh, that’s really interesting. If I don’t want to process claims and do a risk model, maybe I can use BRFSS. Now, the granularity might not be right, but I have these communities that are using a greater degree of granularity if I’m focusing on Massachusetts or Connecticut or something.
DR. MAYS: It’s really about how are we building people. This kind of gives you a sense of that. Damon, Mark and then Leslie, and then Paul.
MR. DAVIS: One of the things that I want a little clarification on is how we have been talking about community today, because I think I heard it a couple of different ways. One was the sort of very localized people in a physical community who are looking at the data about the people around them that they serve in a specific way. And there is the other community which is the press, the developers and entrepreneurs and innovators, sort of more groups of user types which you could also define as community.
But by the pat on the back and the welcome, I suspect I am not going to get too much clarity on what we would like to focus on in terms of community because I think whatever we do focus on is going to be incredibly informative for what I’m trying to accomplish in the Department. And I feel like, from a perspective of driving healthdata.gov to be a better platform, it’s more likely that I will be focusing on community from the perspective of specific disease states and specific user types, the entrepreneurs and so on.
So I am focusing very much on the conversation from the community as it applies to user types. That could go all over the place, to be quite honest with you. You could have a very long dropdown list of user types regardless, but I would love for us to pick one and focus on that. I think that, ultimately, the localized communities are going to benefit from the user communities that I am most likely to be supporting, so I would love to focus on the user community development.
DR. ROSENTHAL: And you could do some tricks for conflating those so it’s not a big dropdown. I was talking about expertise versus — we picked three. It was communities, communities proper like developers, and then also business and entrepreneurs. If you wanted to not have self-identification of user capability you could just do it by the star count others are offering to you.
Beyond that, though, you started out very early on saying we have to be mindful of that. So if I’m going to say how do we build a community, one way is to lower the bar to interact with the information, not the data. I want that 16-year old girl who literally won the ReadWrite web thing out of 500,000 applicants looking at co-morbidities who can’t code a lick because we put it in a browser. She doesn’t have to look at the data; she’s not even using Excel. If you’re asking how to build communities like that, you can compress the list but you definitely want to be mindful of capabilities.
I would argue that what we have done is great, but liberating data is not so interesting to me. Liberating information for people who don’t know how to work with the data is like the next 2.0, and there are very specific things we can be doing around that.
MR. SAVAGE: I think I have a question parallel to what Damon was asking. I appreciate hearing communities this and communities that, but I suspect there is a range around the average. We heard an example of the community that accepts whatever data they can get without respect to quality because they are just appreciative of getting it. We also heard that communities are growing increasingly sophisticated.
In my experience, there are community-based organizations that are doing some very sophisticated healthcare delivery and a combination of datasets to do that because that’s the only way they can get what they need, in large metropolitan areas. I suspect it’s quite different in some rural areas. I am wondering if there is a useful introduction just around what is the range of communities that we are talking about that we are building for to inform our work. I assume it is not one size fits all.
DR. MAYS: What we try to do is have these kinds of user cases so we can make sure that we have some clarity of what level are we talking about and the ability for us to ascertain who is it that this fix or solution we have come up with is relevant to and who is it not relevant to. It’s almost like as soon as we don’t use that when we’re talking, this is how we get back to here, which is — I’m talking about community as a consumer. Somebody else is talking about community as the entrepreneur. That is part of why I think they were doing the introduction and saying this is the perspective from which I speak, because that is one of the problems.
I think we really do have to grapple with what we are going to do about this because it always comes up what community are we talking about. I think we have got to come back to what community are we fixing something for or designing something towards.
MR. DAVIS: I thought we had that defined. We should just have three little hats in front of everyone.
DR. MAYS: Bruce did it on his group. They had this general definition of what community is. For us, it involves a skill level so we can’t quite do it. His is a person thing. Ours has kind of a level of people’s ability to be able to use.
You wanted to comment, then Leslie and Paul and then Bill.
DR. COHEN: Damon, you are all absolutely right. I laid out a straw man case. Obviously, there is enormous variation in community capacities. I have seen some communities that use Bayesian probabilities to predict where the next neighborhood epidemic is going to be.
Ultimately, I guess for me, keeping the eyes on the prize — the prize for me is the federal government being more involved in improving the quality of community life. That is the goal here. If we don’t keep this straw man in our minds as the ultimate data user, or the ultimate group for whom we are doing all of this, then we might lose focus and only look at user types that focus on researchers or entrepreneurs who have specific needs and uses.
My role in this group is to bring us back to, in my view, what government should be doing. The federal government has pushed its open data initiative and we are all part of that and we all want to be part of it, and it hasn’t trickled down in a meaningful way to effect what ultimately I hope government wants to do, which is create better quality of life in our communities.
I am happy to say let’s keep that aside and get user types that might be intermediaries to help establish that, but we need to ultimately say it’s not about data; it’s about information, and it’s how information is put into this organic dynamic process that leads communities to do what communities do.
DR. ROSENTHAL: Just use personas. You can be the community person, Paul can be the researcher and I’ll be the entrepreneur.
DR. MAYS: I think what we want to think about is making sure that we are keeping the user case in front of us. But I think, to some extent, I want to make sure there’s a difference between what we expect NCVHS to do in terms of the full Committee versus what this group is expected to do. We could go out and do this at smaller levels, answer the data entrepreneur versus the others. So we can do that. The whole weight of what you just said I think is the role of the NCVHS full Committee but not necessarily the whole weight is on us.
Leslie, then Paul, then Bill.
DR. FRANCIS: This is probably a predictable comment about what Bruce said, which is even if intentional linkage isn’t the interest of communities, there are huge issues when datasets get combined in ways that permit potentially erroneous but damaging and stigmatizing inferences about individuals.
I’ll just give you one little example. The State of Utah publishes on its website, or did when I looked at it a year or two ago, abortion statistics by race and county. Now, there are some counties where there are remarkably small numbers of individuals of particular races in Utah, but you can look at that and realize that there are probably very few Asian-American or African-American women in your county, and you see an N of three in that category, and there was actually one data cell where there was an N of one.
I picked that example also because the desire to have those statistics be publicly available is politically motivated. There is actually some likelihood that people might look at those statistics and be interested in addressing anti-abortion propaganda or other sorts of efforts at individuals or clinics that might serve individuals from that community. So information like that, if you are not careful about ways data can get combined — also, when you are dealing with small numbers, there’s a huge potential for erroneous inferences. We are very aware of the background check erroneous attributions to people and the long-term damage that can result. So that’s a caution.
DR. TANG: This is in response to what Josh talked about in terms of survey data feedback and deriving some information about that, which I thought was a fascinating idea.
Many times people will say I just want to know X about X. I’m sure this has been tried, but what about a Siri for query — the voice recognition. If we can do Tableau, I almost think that means there is enough N in how we formulate our queries, even in natural language, that you could actually potentially answer some of these things, especially with access to fairly rich metadata. That is one question.
The other question is, even if we cannot do a Siri search — and I’m sort of guessing you can do the majority or more than 50 percent handled by a Siri — but even if you could not, you could potentially get all the information that Josh just asked for, those three things. I think listening to someone and the way they got in, I could impute their perceived usage or I could impute their community identification and then I could derive their perceived usage and essentially their free form tag.
I’m sure Josh and others may have the answer to This, but has Siri ever been tried for this? And maybe we could attempt to use Siri to actually provide the feedback, the adaptation on data use that Josh was describing.
DR. ROSENTHAL: There is a bunch of different ways to do that. It can be as simple as exposing a search query. You want to know what to look for? Just expose the queries blinded by top searches and then rank them by top, by most frequency. People reconstruct that with Google and they come up with all sorts of funny things, but there are widgets or scrapers that are literally designed to do that. That is what powers the keyword searches in analytics. You can absolutely do that.
There’s IVR natural language stuff you can do if you want to, but you don’t have to get that fancy. You can literally just use one of these modules. Or, if you don’t want to do dropdowns, you can just do tag clouds — basic tag clouds by suggestion, by most frequent, doing it on the size, and that’s a very good way to build a natural way of doing it.
The answer is yes, yes and yes; there is a bunch of ways to do that.
DR. TANG: Is that one of the paths we could pursue in terms of making the use of data accessible to more folks?
DR. MAYS: I think it’s one of the things we can think about suggesting. Remember, we are in the position of suggesting these things to people as kind of being a good data steward and trying to make your data more available to more people. What that requires, to some extent, is being able to have this background that’s running. Not everybody can do that. The fed should be able to do that for some, but even then, the security that’s built around some of the datasets I’m not even sure.
Again, let’s put it on the table as something to consider. I think our next iteration is going to be to pull up a draft and start filling this stuff out. Then I think we should look at the feasibility of it.
DR. ROSENTHAL: It doesn’t have to be through an HHS site. If I weren’t doing anything during the day, I would just stand up a free shell and link everything at .gov, and literally build this infrastructure easy-peasy with a link to it. Or consortium.
You don’t have to do this through the site itself; you can do it through a wrapper. By a wrapper I mean another entity that has links in with content out. Any 15-year old kid could come along and do it and I think it would be a really fun project. Given the Google indexing, you could take over the whole thing and it would be kind of funny.
But the real thing you should do is through a partnership. Like RWJ, really a consortium should be in the sweet spot for that.
DR. MAYS: That would be an idea. The other would be bring in Code for America and ask them to actually assign the interns to do this for a county or a neighborhood or something like that.
DR. TANG: What we need is a county or neighborhood of the 50-year olds that Josh described.
DR. MAYS: I am sure we can find one in the Commonwealth of Massachusetts. With Bruce and all his traverses around we can probably — if we need to do this as a use case we probably can.
DR. COHEN: It has got to be someplace warm.
DR. TANG: We will host it if you will bring it.
DR. ROSENTHAL: I don’t want to get too technically wonky, but it’s not as if you have to code the stuff from scratch. The stuff is literally out there. If you want to use WordPress, there’s plug-ins to do this. You don’t have to do this from scratch with a code-heavy background.
DR. MAYS: What we may think about is actually taking a case and asking a kid to do it, and putting that up as an example of what can be done, so that a community knows — go find a 15-year old and ask them to help you with this. We’re joking, but I think, Paul, this may be something to really think about as here is an example; let us show you what we did. I think that might be helpful
DR. STEAD: I was just going to comment on the idea of a definition of community. We won’t have a definition. This is, again, a place where we are going to have to have a taxonomy.
When we are talking about the NCVHS full Committee effort to identify the federal role in enabling communities to be learning health systems, we’re talking about communities largely that have some sort of geographic relationship. We are equally interested in communities that share one or more determinants, so those are all about communities who we are trying to enable to improve their health.
As we worked through the framework taxonomy, it became very clear that there were other communities such as an organizational dimension, people that are aligned with one or another organization. There are also simply communities that share an interest. I think the latter gets to be closer to what we are talking about as the kind of communities we are trying to enable through communities of practices, where we are trying to help people do things. Those things may help other communities improve health.
I think we need to think about this as a taxonomy so we can point out — or apply metadata from the taxonomy — that this is the community we are talking about at this juncture.
DR. MAYS: Yes. I think that is good.
MR. SAVAGE: Bruce, from what you were saying, does that mean that if we build it for one user we have effectively built it for many other users; we just have to keep our eyes on the prize, so to speak?
DR. COHEN: Yes.
MR. SAVAGE: And which user is that?
DR. COHEN: It is the user who opens up their Google browser and asks the question tell me about heart disease in Roxbury. Then all of the data sources in our vast world will be searched to generate the data about heart disease — interventions that are known to work, community health clinics in Roxbury — that provides all of that information in a one-stop shopping simplistic kind of place.
That’s the dream, but that would be the ultimate goal, to somehow make the data easily accessible and interpretable to be used by people who want it, and not just the heart disease rate. It would be the diabetes rate, it would be obesity, it would be food deserts in Roxbury, it would be access to fresh vegetables. Anything that somehow is related to heart disease in Roxbury the magic of the web would be able to compile that and provide it to this person.
DR. MAYS: Any other comments before we leave this? Part of what we are going to do is to talk about bottom-up. This I think is one of the most important things for us to talk about at this juncture because it really is getting back to the community issues of how did those voices get in some kind of way institutionalized and giving us feedback.
One of the things is if Damon has to make priorities all the time, we don’t have endless resources. It’s easy for us as researchers because we are at a meeting and we’re going to catch you and fuss and say we want X and Y. But if we’re truly going to find out how to do this with a greater diversity of users, we need some kind of mechanism to do that.
When we were meeting by phone, I thought Paul had, as usual, some great ideas about thinking about this from the bottom as opposed to all the top-down stuff. So, Paul, I’m going to let you start.
DR. TANG: I am a clear bottom dweller, but I think actually this is a bit of a reprise from what Bruce just went over but maybe I’ll just say it in a little bit different way. I would summarize it by saying one of the best ways to make data acceptable is to make it useful in an obvious way. Let me break that down.
Useful — and I don’t want to get into that whole discussion, but we can’t possibly understand whether it would be useful to someone without understanding the problem they have to solve for that community. If you take some non-health related thing like which car should I buy, well, you can’t answer that question without asking a bunch of questions if you want the problem solved.
Generally, if one community that is older than the rest of America, the problem they have to solve could be loneliness. We have to understand the problem before we can figure out whether — and this was said before, too. You have to understand the purpose before you even declare whether it is useful or not. So we can’t possibly go around promoting data without understanding what are people’s problems to solve. So that’s the bottom-up perspective.
The obvious way, and I think Josh brought it up, is that we need to have exemplars. When we went over the key concepts, we have got to have exemplars of what are data and how would you visualize it to make it relevant to the decisions that you, community, have to make. That is what turns — you know, you can’t lead a horse to water. You can’t just be close to water; you have to make it in an obvious way, and I don’t think you can do that in any other way besides coming up with an exemplar of how do I visualize what data to help me make decisions that are more informed.
Then, to make data accessible, I think it’s the data concierge, the REC, the boots on the ground that we talked about. That’s how you make it truly accessible to other than the researchers who have made a lifelong career out of going to find these things.
Another comment about accessibility is what data is now acceptable that wasn’t nearly 5 years ago, and with that I am referring to the whole electronification of clinical and person-generated data. The clinical HER, the person generating it is what you might call the patient portal, but other ways that people do explicitly communicate about their health and things related to their health are now instrumented. How do we take advantage of that? That makes a different kind of data and potentially data that are far more relevant and far more useful than the data we have in our vast storehouses.
Finally on accessibility, it has got to be in the workflow. The workflow of a human being on this planet now is mobile. It means the workflow for the city council members who are the policy-makers in the community, for the PTA, for the Rotary — that is where people access their information.
And on the EHR, even as pretty as we could make data display, you can’t post it and affect one doctor’s life, as one example, and the people that he or she is going to see that day without putting it in the workflow of the EHR. In our vernacular, that would be the clinical physician support.
I think you have to really look at how does it affect the everyday life of people that we are trying to impact. Sort of what Jim Scanlon was saying, the researchers have been bred to go forage for this stuff and then figure it out once they come across one of these pieces of data. But for the everyday user that can impact so many of the rest of the 99.9 percent of it, that really has to be accessible in an obvious way and addressed by a problem to be useful.
I think that is another way of stating what I think Bruce so eloquently articulated just before this. So that is the bottom-up perspective, which I think is the most important part.
DR. MAYS: Yes. I might be a little biased as well, and that is kind of what I was saying. I think we naturally do the other levels and we are very clear about the top-down which is — okay; tell NCHS to do this. But part of being able to direct the improvement in terms of health and healthcare, we need to figure out more and more the ways in which we can hear from the everyday person.
We have to do it the way they are in their lives. They are not going to go sit at a computer and send us an email. Sometimes, when we are thinking about setting things up, the way you set something up to look at it on a computer versus setting it up to look at it on a phone is sometimes very different. There are people for whom their phones are going to be their primary source of both looking for, sending messages, participating. So we do have to kind of remember that there are generations who own a phone and not these big computers anymore. Not even tablets, to some extent. Instead, it’s the phone and the size of the screen.
What I want to do is kind of do just a little bit of a wrap-up for this section, take a short break and then let us move into the updates. When we finish, we can see how much time we have for the other parts of the Agenda. I think what we have done is pretty important, and it’s a really big chunk of work for us to move to the next part of it, so we may have to put some of the other work on hold anyway.
DR. COHEN: I wanted to pick up on a theme that really resonated with me that Paul brought up, and also that Josh did. It has to do with information rather than data and giving communities choices.
Fortunately, communities, like individuals, are not always rational decision-makers. They are driven by a variety of inputs that we as data supplies cannot understand. Our role is to provide information and I think also to provide context and options.
For instance, in my heart disease example, the national prevention strategy, a variety of folks have done an assessment of strategies that work to reduce AMI deaths. We know there are lots of options to address a variety of issues in communities. If we can assess what the level of problems is, what we know works in terms of intervention and add the overview of the community preferences — whether they want to focus on problems related to the elderly or to youth or to violence or to chronic diseases — if we can marry the data with what we know works and with an assessment of community values, then we will have turned the data into information.
I think this technology and knowledge exist, so that is really the challenge when you’re thinking about a bottom-up strategy.
DR. MAYS: Great. I think we have a really big to-do list from today, but part of what I want to do is to back up and ask Damon, of the things we have been talking about, do you have priority areas of what would be useful to you in the short term and those things that we can work on in the long term? Do you have any short-term requests?
MR. DAVIS: In fairness, I should tell you we are going to be working on revitalizing healthdata.gov. We have a procurement on the street right now that is going to allow me to incorporate some of the things that you all are talking about here. So this is very valuable input. I want to underscore that I am not just sitting here in attendance and not taking away action. I am certainly thinking about how this can be applied to healthdata.gov and its new instantiation. So this is incredibly helpful.
One of the things I have said and I will say again is the community-building piece is going to be very much the next component of what it is that I’m trying to focus on. While I am paying close attention to all of these, I have been taking copious notes on some of the community-building pieces of the discussion that we have been having for purposes of hoping to develop a plan that is going to go to CTO Brian Sivac so that we can at least start to talk through what is realistic, what is possible and what the opportunities and challenges are going to be out of that plan.
So, community-building both from a sort of structural, tactical thing like some of the ideas that Josh has put forth as well as, Bruce, your discussion of what communities are going to desire so that we can build out the usefulness of whatever the tactical or implementation is are going to be very complementary and, therefore, valuable. So that is one of the things that I came to this meeting with a specific focus on. If nothing else, that is one of my top priorities for right now.
DR. MAYS: Anything else? I think what we are going to do is take a break. Let’s see if we can keep it to 10 minutes. Then, if Damon and David will also come to the table we will have an update and then I can pull back after that and say what we will do next. Paul, will you come back in 10 minutes?
DR. TANG: I will. I do have a commitment at 4:00 o’clock.
DR. MAYS: Okay. Thanks, Paul.
[Break.]
Agenda Item: Update from HHS Idea Lab
DR. MAYS: What we are going to do now is what we usually open the meeting with. We are going to get an update from Damon, and if there are any recommendations that we made previously that you want to find out where things are, please feel free to ask him that. Afterwards, we are going to ask Damon to introduce David to us.
What I have asked David to do is share with us any comments about the things we have been discussing, and if we have time left he will do DDOD, which I will let him explain to you.
One of the things I think we may want to do is to have a longer session for the work group where we can actually share with you the URL and have you delve a little bit into it before we have a long discussion.
Damon, thank you.
MR. DAVIS: Thanks very much, Vickie. I have already scooped myself a little bit on some of the talking points that I was going to address today which is going to leave more minutes for David Portnoy to talk about demand-driven open data. I just want to give a couple of quick updates.
The first is the last thing that I said. We have a procurement on the street to revitalize healthdata.gov, and that has multiple different objectives. Not in any particular order, here are some of them.
The advancement of the platform is going to make for better usability in terms of finding and locating and interacting with the data. It is obviously going to have better back end administration. One of the challenges I have as a user of it myself from the backend workflow is, quite literally, just knowing the status of datasets in a very easy and functional way. So I am looking forward to better usability from the administrative side as well.
We are looking to build in some of the functionality that we have talked about here today — the ideas or concepts behind rating of datasets and dataset quality or usability or what have you. So building in some additional functionality beyond just having the platform be a catalog, but having it actually be more of a resource is going to be very key.
We want to make it flexible for future iterations of the catalog as well. It is very sort of static. It’s an amalgamation of multiple different software platforms and, therefore, slightly challenging to deal with. So we are looking for a refresh platform that is going to take advantage of current technologies, and we are pretty excited about the fact that that is coming down the line.
Another thing I wanted to flag and highlight for folks since I haven’t seen you guys for a long time is the fact that there is a lot of movement in the sort of data science, chief data officer realm, and I will point out a couple of things.
Nile Brennan was recently named as the chief data officer for Centers of Medicare and Medicaid Services, so he has been instrumental in many of the open data efforts that have come out of CMS. I think you are starting to see more and more of those chief data types of positions across the Department as well as across the government at large.
An example of the larger higher level chief data scientist is the recent appointment of D.J. Patil. He used to be the data scientist at Linked-In and he has now been appointed as the President’s chief data scientist for the White House.
What I think you are starting to see is recognition of the next phase of data utilization out of government resources. Not just the sort of Version 1.0 of locating and liberating the data, making it available, but actually the analytics of making it more usable. Where are the connections that we need to make between the data, and how can we start to take a more scientific and proactive approach to the ways that the data across government, not just to HHS, are utilized. So I just wanted to flag and signal for you that you are going to begin to see many, many more of those kinds of appointments and assignments to chief data type positions.
We are also seeing within the Department the need for more of those chief data types of individuals. We have all kinds of data resources that we have discussed at great length in this group and others, and I think we are really starting to itch and strive for the next level of value out of the data that we collect and curate. So it is kind of an exciting time from that regard.
Another piece that I would love to encourage the group towards is helping us figure out what it would look like to develop a data scorecard. What are the elements of the data that we are producing that you would like to see be part of a scorecard that would allow a user to quickly understand what this data has that is of value to them?
The usability scorecard could have elements of things like whether we have the classic sort of pdf versus API on a scale of “liquidity”, quote, unquote. There is the idea that the metadata quality could potentially be scored, and there are multiple elements that we could consider here that might roll up into an overall comprehensive data scorecard.
The reason I think that is a valuable exercise is that it would allow me and others like me across the Department to turn to our colleagues who are the owners of the data and say, here is something that we really need you to work on, and here is the score that this dataset or this data resource gets, and here is why we think it would be valuable to spend any level of resources on making this any better. It can come off as somewhat subjective for me or someone like me to simply turn to one of my colleagues and say I’d like for you to make this better or turn it into an API. To have something that folks are able to actually point to, a tangible asset that says here is the scorecard that either the community or someone else has basically assigned to our data resources could be incredibly valuable as well.
DR. MAYS: Can you describe it in terms of is it that your office wants to develop it? Is it that you want us as a convener to help get it developed? If you could say more very specifically how it is going to get developed and then how it gets used I think that would be helpful.
MR. DAVIS: All good questions that I don’t necessarily have answers to. I have already given you a use case for how I would use it, which is as a way to turn back into the Department to say here are some opportunities for us to make our data better. It could be along the lines of — to advance to David’s presentation — if we have this demand-driven open data that basically is giving us a signal as to what people are desiring, we now know what we want to focus on. We have targets. But if those targets have low scorecard values, then there is an opportunity for us to actually make those data better for those who have decided that they want to use this data.
If we have demand for something and we also realize that we need to make it better, the scorecard is going to help us understand where it is that we can allocate the resources to make it better there.
Who would develop it? I don’t know. That could be a function of you all providing the framework for what could be valuable in that space and then say the Health Data Consortium potentially, as an example, could be an organization that might set up a framework, set up the actual implementation of it. Totally off the cuff, and I am not by any means suggesting that the HDC is already involved in that. But that’s one scenario for its implementation potentially.
DR. MAYS: Do you need a recommendation to have a data scorecard, or is this something that the Department is saying that they are very interested in but it is just in a vague place as to who and how? I’m just trying to get a clear sense of what our role could be, should be or isn’t going to be.
MR. DAVIS: I would turn this around back to the group. Does the group feel like that would be a valuable thing to have associated with any dataset? Kenyon put up eight to ten different components. That basically almost looked a little bit like a scorecard to me, right there. If that would be valuable for the data-using community to have — whatever, stars, numeric ratings one to ten, whatever — if that presentation was associated as a scorecard for the datasets, would that be something that data users would find valuable or not.
I do not want to create work; I want to create something that is going to be valuable for those that we are attracting. But if we are talking about reaching very intelligent and savvy researchers and the lay public to these data in order to help them satisfy what it is they are trying to accomplish, I wonder if in fact something like this could be of benefit.
DR. MAYS: Paul, make sure and let me know if you want to comment. We are going to start with Bruce.
DR. COHEN: I am really ambivalent in the truest sense of the word about scorecards and grading. This arises out of local public health experience. New York City rated the quality of its restaurants and it essentially created ten times more work for the public health department to explain and adjudicate anybody who did not get an A. My local community decided it wanted to post all the information about our inspections — the same thing New York did — but not attach letter grades. People don’t use it as much but they have access to the information.
I guess I am worried that we don’t create more incredible work for the data collectors and data holders if we create some kind of qualitative evaluation that they might disagree with. Providing characteristics about the data simply to guide people’s understanding of the use I think would be a wonderful addition, but providing a scorecard or grades, again, can be problematic.
Some people love it. I use Yelp and Trip Advisor and it makes sense to me, and, at the same time, these other issues.
MR. DAVIS: The funny thing is an A for one community could be an F for another. My Mom might want to use a query tool, in which case she has an A in terms of the usability of that dataset. You, as someone who is actually looking to grind the data and analyze it with multiple other datasets, are going to find that data resource as a query tool to be an F. So we do have to think through what it means to assign quantitative values to stuff. However, I think it will also be valuable in terms of turning to those who are the data owners to say this is an opportunity for us to make this that much more usable, and here are the reasons why.
There’s a tight rope to be walked on that one.
DR. SUAREZ: To me, the valuable part would be to offer the ability for data users to create their own scorecard. In other words, I want to see if this data help me address the need that I have and I want to obtain — we talk about metadata — out of the metadata information of the database and other information, I want to sort of score the dataset to see if it meets my need. So it’s more of a dynamic. I wouldn’t necessarily call it scorecard, a dynamic evaluation of the usefulness of the data for the purpose I am looking. Yes, as you said, what someone believes is a good dataset for what I need to do someone else could say, well, it doesn’t help me do what I need.
In my mind, it is much more valuable to create models that could help people evaluate the datasets — call it scorecards or generating evaluation information. It’s not so much the importance of the name, but it’s more the ability for the users of the data to be able to more systematically evaluate the value of the dataset for their purpose.
MR. KOSHAL: Damon, I have a question for you and this is something I have been struggling with for a bit. We can’t be everything to everyone.
MR. DAVIS: I totally agree. I am so glad you said that.
MR. KOSHAL: So how are you and the agency and Brian and all the great people we used to work with thinking about prioritization and the trade-offs that you have to make?
MR. DAVIS: That actually goes back to the point about community-building. Brian, Greg Downing and myself had a couple of conversations where we basically said if we didn’t liberate another dataset and we focused on driving traffic and community towards the utilization of the data we do have out there, and in recognition of the fact that we can’t be all things to all people, what would be some of the initial communities that we would focus on in terms of supporting their data needs. The health Datapalooza was stood up out of an interaction with the developer and entrepreneurial community, so arguably that would be one of the first and highest priority community types to focus on.
We have, for example, begun to see more and more data visualizations out of the Department, so do we need to focus then on sort of the viz community for the usability and sort of understandability of the data? I fully recognize that I am not going to be able to support every single American, like the guy on the street, right now, but I do think we can focus on, again, some of the higher level community types that we talked about earlier.
We may need to start with a few, figure out what we think the value proposition is for supporting them juxtaposed with the level of effort it would take to support those, and I think whatever falls in the sweet spot of highest impact with lowest level of effort, to the extent that we can make that cross-cutting analysis, that is going to be where we would need to start. To the extent that we can feed in other communities at a later date, we would do that.
I think that is probably what you are going to see the approach look like, is let’s focus on these two or three top sets of communities and then we will figure out how best to serve those folks first, and then we will build out any other engagement that we think we can. And that I think is going to also come very much from the DDOD that David is going to talk about in a few minutes.
MR. KOSHAL: I’m a big believer in the multiplication effect, and I think Datapalooza has done a wonderful job but it is also where the market is becoming efficient. So there are business models; therefore, entrepreneurs and technologists are coming in, but there are lots of different areas where there is still market failure.
Maybe one suggestion would be if we could prioritize that part of the world as well. I know we can’t go everywhere in terms of where the markets fail, but there must be some big impact areas with different communities where you can still have an impact. The data form they need might be a little different. It might require more work, but again, if there’s the multiplication effect and it’s focusing on problems that the entrepreneurs have I still think it’s worthwhile.
MR. DAVIS: Yes. That is an interesting point. The failure area may have a very large level of effort but it could be an exponential benefit. That’s why I was saying that to the extent that you can formulate this cross-section of a very easy level of effort to potential for impact, there are going to be some things that will fall on the outside of that level of effort but with exponential impacts. We are going to need to figure out how we can get across the finish line with those elements as well. It is no small challenge, to be sure.
I would welcome the input of everybody around the table to suggest the various communities that you think would be instrumental in supporting towards the learning healthcare system and improvements in healthcare and health quality and equity and things along those lines. Don’t be shy about making your suggestions for whomever you think we should prioritize among our tops because we’re taking input from wherever we can, and we also welcome your input in evaluating the process, too.
DR. MAYS: I think part of that is if you take the multiplier example, you actually have these communities that HHS is doing that with within NIH and within very particular agencies. There is the CTSI, and they are spending a lot of time and money within NIH to have a community that is very informed and wants to use data. And they are being paid to bring them together. The training of that community is already being done. I can’t remember how many millions but it’s almost like each CTSI is getting about $20 million and it has a huge community component to it.
Also, we saw this at Bruce’s workshop. SAMHSA has a group of community people and that is what they are interested in. HRSA has a group of community people. These are the people that their bump-up isn’t as great, I think. These agencies are already there, they already have them at the table, so they are probably prime to sit and do data stuff with. And then they have an infrastructure to go back to. They don’t even need to come back to you, but CTSI would take care of them, or HRSA or SAMHSA.
MR. DAVIS: Yes. I think what you are alluding to is the fact that we need to take advantage of the existing efforts, wherever they may be. Some of them may need small support to grow; some of them there will be places where there is none, as Mo has alluded to. That is certainly going to be part of the process, just sort of continual conversation across the Department as to what community engagement looks like — with the loose use of the word community — to try to build it out and make it more valuable such that data really does become a resource for multiple different user types across HHS’ various endeavors.
DR. FRANCIS: I just have a very simple point, which is that a feedback tool can have at least two different goals or audiences. One could be the Department, to improve on how they are doing. The other could be other potential users or the general public in some way of the dataset. I was thinking of the analogy to course evaluations. One is telling faculty what they need to do to improve their teaching, and the other is student popularity polls. The two are not entirely aligned, although they are relevantly aligned in important respects.
As you think about developing feedback tools, it really is important to try to figure out whether you want it to be information that you guys use — maybe public, maybe not. I’m always in favor of it being public — really asking for quite different kinds of information than popularity contest.
MR. DAVIS: Thank you.
I guess the final thing that I’ll say is that we are in the planning phases for health Datapalooza, so one of the things that is always supportive of the Department’s data agenda is the use cases, the stories of how the data have been utilized. So I would very much appreciate from any one of you all some of the creative utilization of data stories that will be instrumental in making the palooza informative to multiple audiences.
I run a session there called the Data Lab where I try to marry up the data owner at HHS with an actual user of their data to tell the longitudinal story of why the data collected and curated and what format you get it in now, and then turn to that external user to say and here is how I’m using it. So I would love any inputs that you guys can provide into some people who are doing really interesting, creative, moving the needle kinds of things with the data. I would love to feature them in the Data Lab with the data owner for the dataset or resources that they are utilizing.
That concludes what I was going to say. I am going to turn attention now to David Portnoy. I have mentioned him here several different times. David is an entrepreneur in residence from Chicago that we, in the HHS Idea Lab, have invited in to focus on what we call our database linkages entrepreneur in residence project.
Just as background for everybody, I don’t know if you realize what the Idea Lab is focusing on so I will just say a few quick words about it.
The Idea Lab is looking for the innovative pathways, the creative methods through which we can deliver the services of government better, more efficiently and effectively. One of the pathways for that is entrepreneurs in residence where we invite folks from the outside to bring in their expertise to help us solve some problems we have identified internally. David is an example of one of those external entrepreneurs that has become a Departmental employee in order to help solve some problems.
The problem that David has identified is our inability to sort of prioritize and gain meaningful feedback from multiple users of our data about what they think we should be liberating and how.
Without stealing any further thunder from David, I am going to turn it over to him to talk a little bit about demand-driven open data and database linkages across HHS.
MR. PORTNOY: Thank you very much for that introduction. I think that covers 80 percent of what I was going to say. I really appreciate being invited here. It is a great honor. This is a whole new world for me, so I appreciate it.
Vickie has asked me to compare and contrast what I have been doing with the great work this committee has been doing, and I think the best way to sum it up is you guys are playing chess and I’m playing tic-tac-toe. You are at a much higher level. You are thinking about this strategically, long-term, and you are looking for overall recommendations that could be implemented over many years.
Right before coming to the entrepreneurs program I built a company called Symbiosis Health. I was actually a firsthand beneficiary of using a lot of HHS datasets, and my problem was that I wished I could have just told somebody the way that I need to use the data. It wasn’t that the data wasn’t available; it was that we needed minor improvements.
It may be something as simple as providing the datasets on a more frequent basis, or having somebody explain or provide the data dictionary to us, or just improve the data quality or at a field that we can match across different datasets — things that for HHS, in theory, would be very low cost but for industry would be huge. You can bet that any time there is an organization or researcher who needs something very specific, there are probably dozens more who you don’t know about that who are like that.
So that is where the idea of demand-driven open data came from. It is kind of predicated on the assumption that the most interesting kind of uses and linkages that are happening with the datasets currently — basically, the things that are leveraging the existing assets of HHS — are happening by industry, by researchers and by citizen activists. The goal is to give them a voice and help them do the things that they already do better.
The way I describe it is that it is a systematic, ongoing and transparent mechanism to tell HHS what data you need. Every project, every open data initiative would be managed in the form of a use case. In essence, you have a customer before you build the data, before you spend the money. That gives you the ability to prioritize and it gives you the built-in feedback loop and, most importantly, you know when you are done. It is that customer that tells HHS at the end that yes, indeed, you have accomplished what I needed and you are providing me value.
I think I will stop there. My goal today is just to introduce some food for thought and perhaps stimulate some discussion without taking too much time. Maybe next time we can talk about — there is a website up, and as Damon mentioned, if you are familiar with specific use cases it would be great to get them onboard and just work through them. It is demand-driven-open-data with dashes between every word.
DR. MAYS: I have it. I will make sure I send it to the group. One of the things I think we will do — I did go to the URL and I would like to make sure that on one of our conference calls we actually have him talk about it in more detail, and we can ask some very specific questions. My sense of it is, if you look through the website you get a totally different idea than what you just think about from the title.
Let me ask Paul — before you go, do you have any questions?
DR. TANG: No. This is a very interesting approach and it actually is a bottom-up perspective.
DR. COHEN: Again, very exciting, thanks. I look forward to working with you, David. I have a question and a comment.
The comment is that there is no central complaint department. Who owns the data? If you have problems using the data or an issue, I don’t know where to go. You can try the public information office in a particular agency and you can end up with the programmer who loads the data. It would be wonderful if there was a system of triage for questions like you are asking in a more efficient way throughout HHS. If you could figure out how to do that you will have gone a long way.
The other curiosity, and if you want to wait until some other time to have the discussion I would be very interested to hear what kind of linkages you were doing. Were you doing individual level linkages using secondary identifiers in a probabilistic manner, or did you have direct kinds of link variables, and what datasets were you linking? It’s more of your past than your future but that would be very informative as well.
MR. PORTNOY: It is a good question about who is the person doing the central coordinating. The concept is that there is basically the equivalent of a community manager or a DOD coordinator. Work doesn’t actually get done there. The actual work gets done at the level of the program owner or the data owner, which could be deep within the bowels anywhere within HHS.
It is up to the coordinator to basically intake the request, which there is information on the site how to do, and make sure that they are being handled in a prompt manner. So they are facilitating. What is not clear yet is how well this can scale. Right now, I am the person doing it and I am already inundated. I have 17 use cases and six customers, and there is not enough time in the day to get to everything. There’s a lot of tweaking to do.
The other part is that we need to actually have promotional campaigns to get the word out. It is good to have something, but if people don’t come and people don’t know to use it, it doesn’t do any good. So we are out there kind of pounding the drums. In Datapalooza we are going to have a panel that is specifically for DOD, and we are getting the data users. We are going to go through the use cases and we are going to get people really inspired and hopefully that will mushroom.
MR. SOONTHORNSIMA: Ob Soonthornsima, NCVHS Member and Member of the Subcommittee on Standards. Damon and Vickie and I kind of talked a little bit earlier in terms of the priorities, and now that you have introduced yourself I see all of these things coming together. So let’s start with it’s more like a demand management and triaging or concierge, pointing people to where you are going to get the information to answer, or the data ultimately to answer those questions or problems. So that is kind of like at the top.
What we talked about earlier today, a little bit lower in the layer where you actually have a governance process — and this is more of a question. Is there a data governance structure, formal or informal? I imagine there is a lot of informal within the agencies themselves, or formal within the agency but less formal across the Department. That was a point made earlier. You have different groups of users within this particular agency and a common external user, but when you try to span across the enterprise of the department, that is where the bigger issue lies, especially when you are trying to come from outside, looking in. So which agency do I go to? That was the first comment.
Maybe that is a potential priority for you, to kind of look at what does our data governance infrastructure look like, the process, or even a stewardship program in the master data management concept. I don’t mean this has to be done overnight because this does take that long. Even in a small corporation it takes 5 to 10 years to actually stand up a governance program. That is going to be an evolution.
That is when we got into another layer below that governance process, which is really this concept of metadata that you talked about earlier. The governance, as you know, you have who the users are and who the data owners or stewards are, from the IT, from the business standpoint, the use cases or sources and so forth. Those are some of the elements that you actually listed on your slides earlier.
But now, the metadata — and again, we are not talking about deep in the bowels of each data warehouse or data sources. Really we’re talking about a sliver, something that can be discoverable. So the concierge at the end of the day can at least triage and point to the right place. Am I making sense here? That’s what we were talking about, that level, and it sounds tactical but unfortunately, the liberating thing we have been trying to address, this vision, it does require that type of more of a structured approach.
That is my only comment, and some of these might have to be baked into your short long-term priorities.
MR. DAVIS: I think to the governance point, it is very much right now — I recently heard HHS described as the Humphrey Building is sort of a holding company almost for a lot of franchises that have autonomy to do a lot within their own budgets, their own strategic objectives and things along those lines. Judging by the chuckles around the room, I suspect that that is fairly accurate.
I bring that up because I think you find the data governance is very much federated down in the organizations that are the true data owners with the smees inside them, in which case, data governance across the entire entity is a higher level coordination than it is at the significantly more granular and tactical level for any one agency. So there are certainly coordinating things that happen at the agency level.
Jim chairs the Health Data Council, which does a lot of work to sort of coordinate data collection survey mechanisms and things along those lines. I run the Health Data Leads Program which is folks who are looking at the tail end predominantly of data release practices, and where are we on privacy and security, quality, attestation, things along those lines, and just generally where is the data and what can we make available. So there are varied levels of engagement in the governance process.
Another thing I wanted to hit on was something that Bruce brought up earlier, which is how does an individual get to the point where they can know where to go to get the answers they need. An interesting thing has happened. For those of you who don’t remember, data.gov federates and ingests datasets from all of our federal entities. So healthdata.gov data are shown on data.gov.
Many, many people are going to data.gov, finding healthcare datasets that really are from healthdata.gov and they are asking their questions there. They, at data.gov, have begun to forward to me and Greg some of the email questions they are receiving about their healthcare data, which is great because we then can turn to the health data leads and say, hey, ARC, we have this specific question about data. And it has been everything from, hey, you’ve got years 2009 through 2012 but I don’t see 2013 and here we are in 2015. So there is that, where is the data, and then there are more sort of technical questions that we are starting to be able to farm out the answers for, which is kind of exciting because you’re starting to get that broader level agency engagement.
Obviously, it could be a huge challenge if we get mass volumes of data, but it then speaks to the desire to build communities that can support each other in answering questions for various use cases. So it is an interesting time to see how the entirety of engagement with the public is going to change based on the community development of demand-driven open data, the ability to traverse the Department in answering questions and things like that.
DR. COHEN: I think that is great, and one thing that we instituted at a much smaller level was when people contributed data to the federated data hub, they needed to identify two people, one who could answer substantive questions about the contents and another who could answer technical questions about data availability and format. That would really streamline the process.
What you might also want to say is when you get asked questions, copy it to you so you can keep track of what those questions are and what the responses are, but it shortens the feedback loop in a way that makes it a lot more efficient. So, part of the requirement to be part of healthdata.gov would be identifying these two key people. And every year send out an email blast to make sure their phone numbers and email addresses are updated.
MR. PORTNOY: This is all great. Policy and governance are obviously very important. A lot of the work this committee is doing probably will help inform that.
In the short run, what I wanted to point out is we are being very simplistic here, and it is important to remember that all of the decisions are ultimately made very close to where the data resides, at the level of the program owner and the data owner. We are there to provide them with vital information about what people need, so we are just facilitating.
It is important for them to get information but there are other things they have to consider. For example, what is the relative cost of releasing that? What is the sensitivity of the data, what is their risk of re-identification? How closely does the request fit with the agency’s strategic goals or the organization’s priorities for that year?
There are a lot of things that need to be taken into account. We don’t want to dictate. We don’t want to do this through an incentive. Nobody would listen to me anyway. But we want this to be a mutually beneficial relationship, and the most important part I think is that everything we are doing is completely transparent and documented and available on the website.
To your point, Bruce, every time we find out information, there is also a solutions section. We are building the knowledge base. If you go there now you will see the beginnings of that.
One really great example of this is — I don’t know if you guys are familiar with the NPES and PICOS program, but one of the use cases is they just, try as they might, could not get any answers from CMS as to what a third party’s rights were to edit data for physicians. This is one of those cases where the actual work is really easy. All we have to do is help CMS provide the information and we are done. Then the use case is finished and there is a lot of value to this company.
DR. MAYS: Thank you. I think what we are going to do is actually follow up with you when we have a conference call, because once people have had a chance to actually look at the URL and get more information I have a feeling there are going to be more questions. I think you are really going to like it and have questions. Thank you, Josh, for making sure that I understood what he did. Thank you for being here.
What I want to do in the time we have left is a couple of things. I want to have a very short time to talk about operationally what are the next steps of what we need to do. Part of that was based on I wasn’t sure what was what. I was hoping we had a staff person here today for us but that didn’t work out, so we still have to talk about that. That is still on the table. Damon is already putting in some extra time on this. I want to do less of that, but we still do need to talk about things going forward just in case we do have resources.
The other thing I want to do is to say part of what we talked about on our call was that this whole framework — no, not framework — this whole guidance comes under the auspices of thinking about data stewardship. Before, we had data stewardship last in this and we were looking at it like — well, tell us what we should be concerned about. Tell us if there are any precautions.
Instead, what we are doing is we are really going to flip this around and kind of position this as this is data stewardship, and what we are trying to do is to help the various HHS entities have a sense of what comes under that, have a sense of the good they can do, and have a sense of maybe their responsibilities as well in terms of making their data more accessible and usable. So it is going to be a different frame.
Leslie, one of the things I wanted to take a moment to do is, given what you have heard today, though you have been commenting, are there any things in particular that we need to start thinking about and worrying about?
One thing that I think we are going to have to have a long discussion about is I hear your worries about linkages, but the problem we have is that the nature of the world right now is all these mash-ups, all these things that people are just — they really do. Again, they will take an HHS dataset and then the community — it’s like throwing spaghetti on the wall and let’s see what we get. Again, I think it was really good for me to go and I had to realize that.
What I wanted and what I thought were the better datasets was not what the community thought. It turns out the community had a really different interest. Mine was solving the problem and coming up with a solution at a policy level, and theirs was much narrower. They wanted to know, in that community, what police should not be given any overtime — it would go on and on. I was like no; we want to describe the problem.
We cannot prevent what I think you are worried about because the HHS datasets will be used by community in the ways they see their problem. The only thing we can do is to hear better what their problems are and then see if, from a level of sound approaches in terms of data, we can make those linkages be able to answer some of those more social driving issues.
DR. FRANCIS: Maybe the way to put this is your comment was we cannot prevent. Well, sure. But maybe that is not our business. I learned a lot from how you were thinking about stewardship because the stewardship toolkit was, in many ways, influenced by fair information practices. The point you made a moment ago is that it is as important when you collect data to make sure it is used, usable and available to the people that need it. What our stewardship framework was largely about from a fair information practices perspective was making sure that is done responsibly — not addressing things like how do you push data out in a way that the community can understand.
With that said as the basis, here are some things to think about that may go back to if not individual linkages, two of the really serious questions that are raised in the stewardship framework that maybe haven’t quite come up in the same way. Openness and transparency came up. If you want a core fair information practice principle it is that there should not be any surprises. The worst thing for all of you would be to have people suddenly be shocked by how their data are being used.
The question that I think should be very much at the forefront is data.gov and healthdate.gov are calling for the release of data. Let’s be sure, though, that as data get released, people understand that that that is happening and what kind of data are getting released. That doesn’t need to be everybody, but experts who care. So that is one thing.
A second thing is purpose specification is another fair information practice, or principle, that I don’t think we have talked a lot about today. A concern that people have is that if data were collected for one purpose and then they get used for a radically different purpose that might be thought objectionable, how do you get a handle on that. On the one hand, when individual health information is collected in medical records — let’s say of the kind CMS might have, claims data anyway — that information was collected so that people could get their healthcare paid for.
Now, it is not too far from that to people having the information used to improve the quality or the cost of care. But it is a big step, in fact, a kind of questionable step, maybe more than questionable if that data were used to figure out who is super expensive for the purpose of discouraging them from enrolling. If Medicare Advantage plans were to use claims data to figure out you don’t want to enroll people in this demographic, in this zip code, that has not got anything to do with improving people’s health. It has to do with improving the bottom line of Medicare Advantage plans.
Another place that kind of point gets made is with respect to I think many people believe that it is just fine to use this kind of data for purposes of figuring out side effects from drugs. But maybe it is not so fine to use certain kinds of data as a way of increasing the profits of drug companies. How you sort that out I don’t know, but I’m just giving you two examples where there are a lot of critics of re-purposing of data.
Whether it is data use agreements, whether it is attempting to continue to have your finger on the pulse of how the data are being used, whether it is at times questions about what data are with what granularity to release — those are all I think appropriate stewardship questions.
Bruce and I talked afterwards and I think I made this point before, but the distinction between attempting to link individual records and getting small enough Ns or Ns where the frequency is high enough so that inferences might be made about individuals. There are some risks attached to that that are not the same as the risks of re-identification exactly but might not be far off.
Let’s say we knew that people in a particular zip code had particularly expensive healthcare needs, and then businesses deciding whether to locate in the community or whether to hire people from that zip code use that data — well, maybe that’s just fine, but maybe it’s also something we ought to at least try to figure out whether it is going on.
DR. MAYS: So things like this we would kind of have a warning about and then, as we monitor these processes, what you’re doing is increasingly warning people?
DR. FRANCIS: One thought is let’s just start with transparency, not only about what is being released, but what are people doing with it.
DR. MAYS: We wouldn’t know that. We would actually put it out and say what purpose we hope it serves, but we won’t know that somebody decides to put a CDC suicide dataset with something else and something else and then what —
DR. FRANCIS: We might not, but we could.
DR. MAYS: We won’t know until after it happens, though. How could we know?
DR. FRANCIS: That is why I started out by saying I’m not sure prevention is the way to start with the question.
DR. MAYS: These are really good.
DR. FRANCIS: Maya may have something she wants to add to this, but I’m just trying to raise what some of the questions are that could be raised from the stewardship perspective that I think, at a minimum, we want to keep our awareness of so that what does not happen is a rush to release, mash, play, use that backfires.
MS. BERNSTEIN: I came in in the middle so I am not quite sure where — I heard what Leslie has been talking about but I’m not sure where I came in.
When you are disclosing data outside of the Department?
DR. MAYS: What we have been talking about is kind of the whole work group issue of trying to facilitate HHS’ ability to put data out there. One of the things I said is I don’t think we can stop some of the inappropriateness that will happen when people put datasets together that we had no idea they were trying to put together. I think that is what Leslie was asking.
DR. FRANCIS: What I was doing, though, was sort of turning the question back. The obvious way to stop it is you don’t release the data. But if the answer is we are not going to do that, then the question is —
MS. BERNSTEIN: It depends on what the data is and what granularity you have in the data and so forth. We do review the data before we disclose it. We try to get a sense of what is out there and how it might be reasonably combined so that we don’t do things that are going to get us in trouble that we know for sure.
As we release more data, more and more possibilities are out there. Every now and again, you have a case where — you have the NIH case where we actually pulled back data because we figured out that that particular data should not be out there and we could do things with it that were not anticipated and that are not good for the community.
I don’t think you can ever anticipate all the possible uses that somebody can put to data in the future. You sort of have to do your best guess before you disclose data about what you know about what is out in the public domain already and how people might be able to combine datasets as a stewardship thing, before you disclose it. But I am not sure we are always going to get that right and eventually people are going to put things together.
We can encourage that; we sort of want to encourage that because innovation comes from that. We want to think about it carefully before we make those disclosures.
DR. FRANCIS: We want to do that, and we want to have possibly some way — I think this is real brainstorming — some way of trying to figure out when there is something problematic, when there has been a shift in the landscape or something that we did not anticipate so that, like the NIH example, perhaps something can be done.
MS. BERNSTEIN: We can rethink how we are going to continue to disclose data in the future. Yes.
DR. MAYS: Okay. We have 10 minutes left and I want to get everybody in here. Let’s start with Bill.
DR. STEAD: Let me pick up from Leslie’s comment about the importance of the purpose of the use. That is just central. We have worked with patients with everything from whether they want us to collect genetic information — it depends on the purpose. If we are doing it because we think there’s a chance it may affect how we are going to prescribe drugs, they want us to collect it. If we are doing it because we are looking for a diagnosis that none of us know what to do with, they don’t want us to collect it.
We need to know the purpose for which it was collected. We need to know the purpose of the person who is pulling it off the site. I understand it is hard. I am not saying it’s easy. And then we need to know whatever provenance is available about the conditions under which the patient, or whoever allowed the data to be collected in the first place, whatever was attached at that point. And to the degree that was segmented, we need to understand that.
If we start — I play back to the comments about folksonomy as opposed to taxonomy. If we don’t have a taxonomy, even a high-level taxonomy, we start by simply asking those questions and collecting them as free fields. Then as we begin to understand their key differences in purpose — you have mentioned a couple; there are several we already know — those could in fact become high level taxonomies that we all understand and the folksonomy could sit under.
So I at least think this is doable but you have to have those four components if, in fact, you want to be able to handle the combination of situations that will evolve over time.
DR. MAYS: Obviously this is going to be a longer discussion because I think very important issues have been raised. Walter.
DR. SUAREZ: I think my point was going to be more about the distinction, because I think we might be mixing a discussion about identifiable data in which there are a number of requirements, including purpose of use and other things, and then I will call it simply de-identified data for now. I think the data that we are generally talking about in terms of liberation and in terms of, for the most part, community data use are data that is not necessarily identifiable. So the purpose of use doesn’t play any role.
I am just arguing that once the data is there, the purpose of use doesn’t play much of a role because the data can be and is expected to be used for anything and everything and as much as possible. I think we have to create a separation in terms of the framework for stewardship, because I do believe there is a stewardship for the data liberated which is intended and expected to not be identifiable. And there is a stewardship for data that is identifiable and has a number of restrictions.
DR. MAYS: I think the assumption is that we are dealing with de-identified data. The problem is the mash-ups that happen.
DR. FRANCIS: One problem is the mash-up, even if not individual re-identification, the inference.
The other is that at least for some, de-identification is not the only — purpose specification may matter even when you are using data that do not identify particular individuals within a group. The Havasupai Indian case where their data were used for ancestral tracing was de-identified data and was clearly, at least by that tribe, regarded as seriously problematic.
The issues are clearly different, but I did want to just put in a little more that de-identification may not be the end-all.
DR. STEAD: And you really made my point. If we try to manage this at the point when people link these together — because we have no idea how they will do that and we have no way of controlling that — we, in essence, have to come at this from the point of view that it is all identifiable. That means if we can capture a few things along the way, we can then manage those combinations downstream. That is my only point. I think it actually simplifies the problem to try to do it that way, not increasing the complexity of the problem.
DR. MAYS: I go for transparency and openness where we say this is what could happen.
MR. SAVAGE: In some of my work, one of the things we have put out there is the prohibition against re-identification of data, to not — instead of looking at how you build things together — it is not a guarantee. But there isn’t a prohibition out there in the first place, and it would be a significant deterrent and it actually involves a lot less effort than some of the things we are talking about right now.
I don’t know if there is actually room for this group to make a recommendation about that, but it’s something that would make a significant contribution.
DR. MAYS: What I think it might be useful, because I don’t know enough about it, is when we have our follow-up call, let’s put that on as an agenda item. I don’t know what it involves, and it may be something for us to talk about. Bruce.
DR. COHEN: I just want to reinforce that I think Mark has a really great suggestion. Even in publicly-available downloadable datasets, you have to get on the web to use them, and if people want to use them you can say — it’s not a formal data use agreement, it’s not a complex data use agreement I guess; it’s a very simple statement. I will use these data but I agree not to attempt to identify any individual or reuse the data.
When we have public use datasets, that is part of the agreement. I guess it becomes a research dataset because that is not truly public use because we’re requiring the user to say that. So there might be a distinction between fully public use datasets which require no agreement, and then another level which we call research data use which requires an agreement not to try to identify any individual in that dataset or reuse it for any purpose.
There are ways to get around it, but I really endorse it at the front end. It is not a guarantee but most people respect and use the data in good faith.
DR. ROSENTHAL: In most other verticals, the estimate is that the really sophisticated verticals understand about .0001 percent of how their data is being used. I might posit that healthcare isn’t as sophisticated as some other verticals. It’s the idea of being able to monitor what is going on and figure it out after the fact. Probably somewhat difficult.
That being said, I think we do need to distinguish what we are talking about. In terms of yes, we can make this assumption that it will all be used and daisy chained and that’s great, but that actually isn’t the stuff we are talking about. So the bulk of it is — and we should be really clear about that — are we talking about public use, are we talking about all-paired claims database, are we talking about unrestricted use, which tends to be a decent amount of what AHDC and some other folks are interested in moving around, unrestricted use. Yes, you can mash it up and make it difficult. It’s really important to make that distinction so we don’t impute a bunch of historical conversation into this. Yes, there is liability.
So how do you go about doing that? I am not a theoretical type person. I think there are a couple really specific ways of doing that. One, just lay out what is allowed and what is not allowed meaning, hey, if you take another dataset that is under HIPAA and mash it up with an unrestricted set, guess what, bad things happen. If you mash it up and use it for model and imputation even on an unrestricted — some basic usage around that would be the easiest way to go about doing that. And you can do that through some folksonomy to say here are typical scenarios of how that happens.
That is probably about the best you are going to get. If you do want to move towards a monitoring system you are going to have to do a community-based monitoring system. Hey, tell on your neighbor on bad uses of data, and just put it up there. You are going to have to have people in the wild flagging. You are not going to use that in terms of catching a bunch of people; you are going to use that as an illustrative example of what not to do. That’s my two cents.
DR. SUAREZ: A very quick note because this is an important point. Most of you probably have heard about the 21st Century Cures, a very large legislative initiative currently under review in Congress. One of the provisions in that bill is a modification to the HIPAA definition of research data to make it part of the definition of operations under treatment, payment and operations. So it is modifying. The purpose, in good conscience, is to try to ease up the ability of researchers to access and use data for research purposes to advance treatment and to innovate and to make 21st Century Cures work.
I reviewed the legislation. It is open, it is available in Congress. When you read it, it will change dramatically how data that was supposed to be and used to be called research under HIPAA and had very strict guidelines on how to access, use and disclose, is going to now be handled as a new operation under treatment, payment and operations.
DR. ROSENTHAL: That’s a great example of the next level of work, based on changing context. Here is an educational piece; here is an example of what you can do with that and here is an example of what you should not do with it. Mashing up to unrestricted sets? Have at it. Here are three things you might want to think about. That would be my vote for how to think about addressing that.
DR. MAYS: I think that is great. Well, one of the nice things today is that you all have a hearing tomorrow, so we got to have more of you here. I really appreciated that, the extra people in the room in terms of what you could bring forth.
I want to say thank you to everybody for your time and for staying. Thank you to the Work Group members who did a lot of stuff in between for us to bring you as much as we have brought you. They have been really very busy, so I want to thank them. We are going to continue fleshing out based on what we heard here today, and then we will have some things to also bring back to the meeting in May. We will know how much and how fast by some of the changes we had talked about a little earlier.
Thank you, everybody. Travel safely. Your efforts are much appreciated, because I think we are coming out with some really good stuff that aligns with the rest of the Committee.
We are adjourned.
[Whereupon,, the meeting adjourned at 4:40 p.m.]