[This Transcript is Unedited]
National Committee on Vital and Health Statistics
Work Group on Data Access and Use
September 29, 2016
Courtyard Marriott Hotel
Capitol Room
1325 2nd Street, NE
Washington, D.C.
P R O C E E D I N G S (2:32 p.m.)
Agenda Item: Welcome
DR. MAYS: Good afternoon. We are going to start the Data Access and Use Workgroup. My name is Vickie Mays and I’m from the University of California Los Angeles. I’m the chair and I have no conflicts. I don’t think I have to do that.
(Introductions)
DR. MAYS: Okay, let’s get started. You should have – I just want to make sure everybody has everything. You should have the agenda which we tried to do a pretty detailed agenda. You should have a copy of the matrix. So if online you don’t have anything just say so and I’ll make sure that we get it out to you but you should have a copy of the matrix which starts with Purpose: To provide a standard approach to categorize and summarize datasets. You should also have a copy of our workgroup plan. We need you to have a copy of the Guidance on Increasing Use and Access of HHS Health Data. So if you have all that then we’re ready to get started.
Let me do some introductory comments and then the only other thing I’m going to do, because I think Richard may be the only person that hasn’t had a chance to kind of introduce himself in detail to the group, so I’ll come back to Richard. But let me just set the stage for what we want to do today and also to have us kind of loop into what happened when we presented at the full committee. And then we will also, when we do Richard, we will also introduce and have Jim say a few words because we’re very lucky we have a lead staff now so we want to make sure that Richard and Jim get to say a few things. Let me do a little bit of an introduction and then turn it over to you guys and then I think we can get further started.
Thank you to everybody. As you know, we have some new people that have joined us. I think we have a great compliment of skills to hopefully get us to our ultimate goal which is – the workgroup has been very invested in trying to come up with some solutions for HHS on how to make their data more accessible and for it to be used more. Damon has been pretty active with us in terms of bringing us at times what some of the issues are and in search of solutions.
Part of what I think we realize is that the kind one-off solution probably isn’t the best consultation but instead maybe what we need to try and come up with is a bigger picture. A bigger picture of what the philosophy is in terms of trying to move ahead on access and use in terms of HHS data as well as to try and develop use cases so that individuals can see themselves in this guidance that we give.
I think that what we’re actually wanting to try and do is to also come up with a – and the other thing that we want to try and do is to be helpful in terms of the provision of some tools that we can begin to have a particular type of data owner look at and say, this is some examples of best practices. And when your contracts come up or when there are resources that they are able to begin to make these changes to those datasets so that not only are they in compliance with kind of best practices but that what happens is that they will also learn about ways in which to increase the use of your data.
So Damon, we appreciate your letting us know where that space can be in terms of the help that we want to be able to give.
We have I think a set of documents that we want to work on today. And the first is to really start with what is it that we want to accomplish. I think today we got some feedback about how wide our bandwidth should be. I think we got feedback today on some approaches to be able to address this. Because as you can see in the guidance document, there’s a lot of questions that we can go through and begin to detail what the focus needs to be of this guidance.
The other thing we have, which Josh is online, Helga is here, is we have the matrix and in terms of discussing the matrix, I think that we’re going to have to decide kind of how the matrix is to be used. I think we have different senses of how the matrix fits in, what the bigger picture is and kind of how to accomplish that.
So I think in having our discussion today we want to make sure that we’re going to be able to come to some sense within the group of the best way to use the tool, the best way, for example, to collect whatever information that we need, how we integrate into the other things that we’re doing on the guidance document. So I think we’ll make sure to do that. And, Helga, I’ll let you lead us in terms of that discussion so that there’s a good sense of your perspective of how to use it. And then we can talk about it as a group.
So that’s the big picture. So what I’d like to do is to give Richard an opportunity to say what his background is. We’ve done this before with Erica and Jillian. So Richard.
MR. LEADBEATER: Richard Leadbeater. I work for Esri. Esri is a provider of GIS software. I’ve worked with them coming up on 20 years. Prior to that, I worked at a public water utility. Health, I’m rather thin on the experience in health but I’m rather heavy on the experience with data. Within my tenure at Esri, I had the Census Bureau as an account support for six long years. I was assigned that for the pure reason because of my relationships with local government. The Census Bureau knew me because they saw me in the local government at the National Association of Counties, League of Cities and various other associations and they knew I got their story but they also knew that I knew the end user story of how the data get used and why, more importantly. So there’s a lot to contribute just from a pure agnostic data point of view.
So data has always been my passion. It is, in my opinion, the fuel for government and I’m constantly working with elected officials in how they need to understand their own data repositories and with the realization how little they know of their own data repositories and how to make that happen.
DR. MAYS: So now you understand why he’s here. Jim, can you say a little bit about your background. Jim is our lead staff and this will I think, really facilitate some of the linkages that we need within ASPE and other places. So Jim?
DR. SORACE: I am Jim Sorace and I’m a biologist by training. I have a masters in information systems. I work as a medical officer in the science and data policy at ASPE. So nice to work with you all. I hope to have a lot of fun moving forward.
DR. MAYS: Okay, let’s turn to the agenda. The first discussion that I really thought would be good for us is to really do some goal setting and to really think about where it is that we want to land with this because this is a fairly large task and so I think the first thing to talk about is identify the type of HHS data. Today when I did the presentation to the full committee, one of the things I did was gave a whole list of different types of data that HHS has. And I think it might be useful to talk a little bit about the feedback we got today. So I’m going to task the people that were there to kind of share what you heard from the full committee. So that would be Walter and Helga and Jim.
DR. SUAREZ: This is Walter. Generally, I have to say, and this is a little bit of bias on my part too, but I think the framework, larger framework document that we discussed, included the topics was very well received. I think people and personally I think really see this as a very valuable instrument to move through and navigate through the various expectations and potential requirements for data systems.
I think a couple of the comments that people made during the committee were number one, through managing expectations really, that one was about making sure that we as a workgroup try to pay attention to the higher level guidance and framework level, rather than sort of navigating into the perhaps nitty-gritty details that could be easily an area where people can tend to do when it comes to looking at this type of framework.
By nitty-gritty I mean getting into the details of systems and all that. So avoiding getting to the details of managing the expectation of the output and the goal here. That was one of the comments.
The other comment I think was demonstrating the use of this framework by virtue of examining and few different use cases and there’s a contextual aspect of this too because as it was discussed, certainly the federal databases exist in many different ways and for many different purposes and audiences and there’s some data that is aggregated data that is available and accessible.
There’s data that has more granularity in terms of being data that contains information at the individual level even if – assuming let’s say that is de-identified and then there’s actually data that is identifiable and that in some cases could be accessible for research purposes, for example, or those kind of things. So there are different contextual elements about the data and then there’s different audiences and customers, if you will.
And so one of the suggestion was to look at scenarios that can test this type of framework with different examples of the scenarios, scenarios that have aggregated data that is fairly accessible, scenarios that have more granular data and even in scenarios that actually have data that identifies individuals that are exclusively and only accessible say, for example, to researchers after they have signed all sorts of agreements and contracts and things like that. So those were some of the highlights of I think what I heard.
DR. RIPPEN: Again, it was really positively received. So again, for those that really worked on it, it was – I think there was a lot of positive remarks. I think that from what I recall from what Jim had said is that not everybody who collects data has a lot of resources and so again, trying to make that balance as far as when is it worth the squeeze, when is it not worth the squeeze was an important one.
The other thing that actually Bill had said was as it relates to the matrix itself that he is going to give us the health data, what do you call them – so the domains and subdomains, so again going back to kind of this concept of data – what category is it, is it a food issue or a justice issue or whatever.
DR. SUAREZ: By the measurement framework.
DR. RIPPEN: No he said health data framework because everybody has too many frameworks and too many matrixes and too many datas.
And the only other thing was the terms that we use, right. Because when someone says data versus a database versus all these things, they mean different things to different people and so whatever I think we decide to use for what, we should be very clear about what it is and try to avoid getting people all excited about things that probably aren’t even to be excited about because we’re using different terms. So that’s what I recall.
DR. SORACE: I just had to drill down a bit more and say that just because the data exists doesn’t mean that we can ever make it public and in terms of managing expectations, we need to make sure that people understand that you just can’t get any Medicare claims data you want so that’s one thing.
I think two other things in sort of managing expectations, you don’t necessarily want to get involved in a lot of phone book and directory issues. It sounds trite but you want to be able to scope it so that you’re not keeping addresses of HHS facilities nationally as part of this effort. It’s important that we make it available but it’s not what we’re really doing. And then thirdly I’d say that NIH, for example, actually provides a lot of open data. In some ways it’s a good prototype and I think it’s maybe under-utilized in our thought processes in terms of how they make it available and how their experiences of indexing. But it’s clear that a lot of what we have on healthdata.gov is other but you want to make sure that people understand what the schisms are between making it available through NIH and then alone with these other additional efforts you’re supposed to manage.
DR. RIPPEN: And then one other thing with regards to at least guidance with the matrix and that was try to populate it with different types of datasets, databases, to describe, to see whether or not whatever the summary is is useful.
DR. MAYS: I think we were able to channel Jim in the sense that some of the comments that he made in terms of getting us started. Other things that I think came up – oh, Bill’s here. Bill, anything that you wanted to comment that you heard. I’m trying to kind of bring everybody up to speed so that they have the advantage of having the full group comments and I think having each of say what we heard is probably good.
DR. STEAD: I am sorry I got sidetracked. So having not heard the conversation, I don’t want to risk repeating what you all have already discussed.
DR. MAYS: Now, one thing that would help is what were you calling – is it the health data framework? Is that the name now?
DR. STEAD: Yes.
DR. MAYS: Okay, we’re good.
DR. STEAD: I would be glad to repeat that alignment, if that would be helpful to you.
DR. MAYS: Yes.
DR. STEAD: So, we’ve mentioned to you at several previous meetings what we’re now calling the health data framework to distinguish it from the measurement framework that we’ve been working on. And with Damon’s help we walked through that with the health data leads group. I’m excited about what the workgroup is trying to do with the matrix because the basic idea of tagging datasets with characteristics of those datasets is totally aligned with what we were talking about in the health data framework and what you’re doing, at least as I understand it, is going to specific use cases with specific groups of users and narrowing it down to the items that they think will be most helpful.
So I think that will give us a ground level way of testing that piece of this and what Vicki and Helga and I think are sort of aligned about, is Pophealth and the full committee will pass the baton on the health data framework to the Work Group and the Work Group can then edit and change as it sees fit. The items based on the feedback you’re getting. The items that are in the dataset characteristic component. You’re only going to be using a subset of those but you can do that, use the experience to refine them, get them in good place. When people want to add something you could look at the broader framework and see and does it already have a source you can pull in but – so what this can do is provide a ground level refinement of pieces and you can use that to edit it and when you’ve gone as far as you want to go with this, pass the baton back to the full committee and we’ll see if we can get it to another place that can pick it up. So it’s sort of that way of moving forward. Damon, does that make sense from your perch?
MR. DAVIS: Yes, it does. I couldn’t help but think as we were sitting here and Rebecca and I were talking about this before we began, how you make the thing actionable and I think I just realized, one of the challenges that I think we have is timeliness for internal participants and therefore having a place where these items can reside and be referenced at a later date is going to be really valuable. So I want to think about that because we’ve got some policy documents that we’ve been sort of collecting and trying to make available to people and I realize that it could be good to have this in a place where people can refer to it in their adhoc date and timeframe in order to sort of engage with this.
DR. STEAD: Right, because I would love, I mean if we could in the end, NCVHS is not an operating arm and I think we did a pretty good job of getting the conceptual frame of the health data framework out there. I would love for it, in essence, to go in sort of some open source form of evolution. I want it – it should be able to be collaboratively drawn on, used, refined without a heavy-duty process around it. If there’s a way to make that happen, I think it could grow and mature over – the parts of it that people find interesting and helpful could grow and mature. The other parts could sit on the shelf.
DR. MAYS: I think it’s part of what we want to try and do is kind of reach out to kind of a few different types of user groups, get some feedback, see what they say and then mature it enough that it would be usable.
MR. CROWEY: I was just going to say to that point, the specific dissemination strategy for the framework would probably be a work item to engage them. Who are the users to engage with? What’s the feedback mechanism? What’s the timeline? Are you measuring whether it’s connecting with the users and useful? I mean as a specific product of itself.
DR. MAYS: I think in term of even putting it out there we want to also try and help people think about how to collect information kind of in the background that allows them to extend who it is that they are serving. So I think it’s probably a twofold thing. I see cards up.
DR. DORSEY: So there were just a couple of things. In terms of the matrix, were you thinking that the audience for this or the user for this would be both federal staff and the public. Because I think there certainly are ways that this work group could help to facilitate our data sharing in some regards, even within HHS. And then also for the matrix, is the vision that we would engage with some of the data owners to complete the matrix.
And so the other thing that I would say is thinking down the line in terms of dissemination, it’s going to be really important to think about a way to keep this a living document and keep it updated because we have created inventories of sorts in the past and it can be a challenge to update them. So if we think about them, sure that we could come up with something but I also – putting all this work into developing this product, we don’t want to look up two years from now and then it’s just kind of sitting there and things are out of date and then it’s no longer useful. So thinking about how we can make this sustainable and a living document and something where we can have that engagement.
DR. MAYS: I wrote down the word sustainability while you were talking. But here’s – we are going to discuss what the vision is but I can just kind of give you a sense of from bits and pieces that keep surfacing that the idea would be to impact you internally in terms of helping data owners to be able to use this and benefit from it. We also want to make sure that we have use cases. So, for example, that was the only other thing I was going – one of the things I was going to throw in is that I think it was Bruce that said maybe one of our use cases really should be the researchers because I was thinking well they do the – but that’s the group that knows both sides of this very well and so they may actually be as early on a good use case to try and interact with, to let them really test this out because I think they are the ones that will tear it up and down and then give us feedback and get us a little further along.
The issue of sustainability, I want to make sure that as we go through this that we do it in a feasible way so that people will continue to keep it up. So I want us to pay attention to things like burden. I want us to pay attention to things like resources and costs. I want us to pay attention to can we build a feedback loop into this such that people will benefit. They’ll want to do this and they will benefit so that indeed they can show results. They can prove that because one of the things we’re getting concerned about is survey rates and things like that. And people have to prove that indeed people are using their data. I remember you had an example when we were on the phone. It’s like well, suppose there’s fewer people downloading your data then you say, ah, people don’t want this data but it could be fewer people download it because they right away know whether this is what they need or they don’t need. We have actually improved. So we have to also think about the metrics that will keep this alive. As I think it was Bill was saying, we’re not operational so we want to make sure that when we give it away and we give it away in a way in which it will be used and continued. So I agree with that. Helga, can I do one thing before I call on you. I forgot. Susan Queen is on the phone so Susan can you introduce yourself and say if you have any conflicts.
DR. QUEEN: Hi, I am here. I’m not on the committee so it wouldn’t matter I don’t think if I did have conflicts. I have been listening to the discussion since it began and thank you Rebecca for the number. It’s our new CBC phone system that would not allow me to be on the WebEx at all. So I’m from the National Center for Health Statistics and very interested in your discussion.
DR. MAYS: Thank you, Susan. I appreciate you being on. She has no conflicts. Helga.
DR. RIPPEN: So, as all of us are thinking about kind of the vision and scope and things that you were talking about Vickie, one could imagine that as it relates to the matrix, that just as we have MedLine where you have the ability to access all articles around medical findings, where you have a standard of a title, you have a standard of authors, you have a standard as far as key tags, you have a standard as far as how do you write your abstract. That if you have a similar model potentially for datasets that people “publish” or make available, that has a standard way of summarizing it, you can then actually start having something similar to a MedLine but it would be a data whatever you want to call it.
The interesting thing could also be that the question of sustainability, and again, this is let’s imagine, not necessarily what we should or shouldn’t do, the question of is it something like a National Library of Medicine should curate or is there another department or agency where the curation of that is available. And again, given kind of some of the expertise and if you think about leveraging kind of this data, health data framework, which really kind of then can grow, now you can actually with that and also methodology kind of characteristics, currency characteristics, once can actually quickly highlight those things that might be a gap for data.
The outreach to the communities as far as what’s available at any geographic level so you know if it exists or doesn’t exist, as opposed to trying to do the Googles of the world. And then the good thing is the Googles of the world, because Josh and I always laugh about it, had the ability to then also subsume it and push it. So again, these are the kinds of things, as far as potentially a way of thinking about just one tiny aspect, not the what’s the best practices for how do you present data, how do you summarize, it’s just really kind of how do you in the standard way think about it. But anyway, that’s just one thing to think about or potential, it just depends what this groups wants to recommend.
DR. MAYS: Let me just follow that up and ask you a question because I know you had asked this week but let me just ask. So when you talk about it in terms of standard, you’re talking about a standard way – not standards, right?
DR. RIPPEN: Yes. I’m trying to be careful about language.
DR. MAYS: I was a little nervous.
DR. RIPPEN: Because there’s metadata and metatype and –
DR. MAYS: When you came back and said it wasn’t even about best practices, I realized that what you’re talking about is just a standard consistent way of doing this and then if we all could do this it might be useful. And there is always going to be – and I think this came up. I think Jim said this because you know, one size doesn’t fit all, there will always be some either types of data or groups that will want to do it differently but to the extent that there is similarities then that would work. Jim, Rasheda and Damon.
DR. SORACE: I think that is kind of NLMish thought is a very interesting one and I’m sort of curious about it on a number of fronts. So, for example, you may want to expose searching for your data but that doesn’t mean it’s available. And we have these datasets but you have to discover that it’s not available because people have to understand what’s available at a public level. And is that going to mean if you go to a search engine you get that right off the bat. So you can spend a lot of time reading on stuff that you’ll never get. So we have to think of how you might optimize the search strategy to meet those needs.
DR. RIPPEN: Actually, it’s one of the characteristics as far as is it open source, is it available, do you need to have special – is it going to cost you something to your point.
DR. SORACE: So those things get built in. You can expose data models and data dictionaries cautiously without necessarily getting involved in any PI personally identifiable information and I think that might actually be a very important thing to remember because just exposing your data model you’ll get more feedback on it, and you’ll at least get some input.
And the other thing that is of interest is I think in terms of who could take it over, and I’m not up on like HL7 or the various work groups there but there’s a lot of interest in just open data in general. I mean the worldwide web community, HL7 and so I was going to ask people who might be more familiar with those groups, is there a – where might this be going that we should be thinking in terms of where it goes, issues that could potentially be owners in two, three, four years.
DR. MAYS: We’ll wait because I think you’re going to answer this directly.
DR. SUAREZ: Oh, yes. This is Walter –
DR. MAYS: Okay, because your question is probably different?
DR. DORSEY: Mine is connected to what Helga was talking about.
DR. MAYS: Okay. So let’s answer this one and then we’ll come back to you.
DR. SUAREZ: Okay. So this is Walter. Yes. While open source and open technology capabilities and development are basically permeating across the entire spectrum of application and activities and there is not one – the good news is there is a lot of work. The challenge is there’s a lot of work is done by a lot of organizations. So as you point out, there’s the W3C which is the World Wide Web Consortium that is advancing a lot of the open architecture for messaging and other things. But in health care there is certainly HL7 through efforts like FHIR. So there’s a lot of technical aspects and a lot of development.
But indeed I think, I mean to your question, the direction is clearly more and more to utilize open architectures and open source standards and then also to not just utilize them but actually facilitate through those open source type centers the development of new technologies that will allow people to do things with the data differently than the way we are looking at it, particularly apps technology that are accessed through APIs datasets that then enhance or augment the value of the data.
That’s actually the biggest element is the raw data has some value as raw data but then it has a lot of the value, it really has augmented data by virtue of the applications and functionality and analysis then through tools that are outside of that data source. So I think that is a very critical aspect of it. It’s really making sure that as we, and as the industry itself but then federal databases and we as the entity recommending some of these, move forward really think about insuring that this type of open source, open architecture and technical capabilities are built into them.
DR. MAYS: That’s good. I think we want to keep those in mind. Rashida.
DR. DORSEY: I just wanted to take a minute just to share an experience that we had. So, several years ago we actually did have a website called the Gateway to Data and Statistics and some of you may be familiar with it. But we actually had a catalog of HHS data resources, so it wasn’t just datasets but it could be reports that are published using data. It was a website. We actually had a librarian who tagged the different types of information. We had a contractor who would actually do like web crawls. And they would update the site and look for broken links.
Now I share that and also like sharing that, we decommissioned that website. Part of that was because of low usage. Some of the things that are in this proposed matrix, some of these aspects were in there. Some of them weren’t. And certainly the maintenance issue and within HHS in particular if you’re thinking about having a web source for something like this, you also want to think about connecting with an agency who actually has resources for this kind of work.
And so for policy offices where we don’t put a lot of – we don’t have a lot of resources to put into websites. It’s very, very bare-boned. And so we can make certain kinds of investments but there were upgrades that the contractor who was managing the site that they had that we just couldn’t do. And so the other challenge was it was hard for us to keep up and keep competitive with what’s going on with another in the web space for sites that can have analytics and other features that we just can’t do. And so then we tend to lose users that way.
We also don’t have resources to do a lot of promotion and so when there’s like a Datapalooza we can talk about it and then we might get a little bit of a bump. And then that’s it. So just keeping those things in mind when you think about the dissemination and how to keep it sustainable because it will be difficult to justify something that doesn’t have a lot of utility if we just are having really low, no users. So I just wanted to share that we had that.
DR. MAYS: But you did it at the back end, I mean somebody went out and did the crawl and did find stuff as opposed to the owner.
DR. DORSEY: And I have to look into – like I was the project officer for it and I took it over but it had been in existence for several years and we had – there was a consultant who was an SME who – so I don’t know the history of it so it could have been when it started that the agencies might have been involved and then it was more updating but even getting people or the agencies to stay engaged to do these kinds of things, you have to sell it to them too because it’s like something else.
DR. MAYS: Thank you for sharing that painful story. Because there’s one other painful story that I keep saying I want to hear about and that is what happened in terms of the data warehouse. That was another one that is slowly closing its doors as well and it’s a resource issue. So I think it would also be useful for us to at least hear kind of what happened and the space that those were in that for whatever reason it just didn’t work.
MR. DAVIS: Are you referring to the Health Indicators Warehouse?
DR. MAYS: Yes. So that’s another story that I’ve been very curious about. So that’s painful but thank you for sharing.
DR. DORSEY: We can learn from it.
DR. MAYS: Yes. Exactly. Damon.
MR. DAVIS: I wanted to circle back to a couple of things, some of which both Helga and Rashida have talked about. The first is we talked a little bit about metadata and one of the things that data.gov tried to stand up was common core metadata across datasets at a high level, how do we describe these things. And then many folks across the open data space have been advocating for common core metadata in a more focused topical domain area.
So if you’re in research, there’s probably several things that are common core to a research activity, the PI, the research, whatever funds the research, et cetera, et cetera. So it will be helpful to keep those kinds of common core tenets in mind as we try to advance this thing.
The other thing that – and I don’t remember why I wrote this down but for the Public Access to Research Data Policy. We talked a little bit about NLM and I think it must have been sparked when we were saying something along the lines of sort of common repositories for data or something like that. At any rate, they’re working to try to understand – for people’s recognition, the Public Access to Research Data basically says any agency that has a research budget of $100 million or more needs to have an access plan and make their data available. They are trying to figure out a lot of the components that coincide with both managing their intramural data collections as well as the extramural component.
I was just going to say the extramural can be as significant, if not a larger challenge because you’re in a position of telling the University of you name the state that you have to retain this data for the duration of the project and then 10 years or whatever the timeframe is beyond that. So there’s a lot of challenges with just general data management. Let alone the more integral pieces of what we’re talking about here of getting it to fit into an advanced matrix of data management.
DR. RIPPEN: To me, that’s one a low-hanging fruit because it’s one of these things where there are so many of us that are required, for example, to do that in the extramural and it’s like where – it’s like is there not a common place to put it. Well, it’s becoming ICPSR at the University of Michigan is actually becoming a business to do this but it’s almost like we all have to do it. There is no guidance to do it so we all kind of do it our own way. And we try and do it in a way in which it maximizes our ability to use it and to have people interact and there should be – there’s not a one-size-fits-all.
But also the other problem in terms of what you’re bringing up in terms of that policy is there’s no one place that communities can find out the studies that have been funded. So you can find out the NIH ones because of the NIH Reporter but you don’t know the NSF, you don’t the DoDs, you don’t know – so there’s now getting to be this I think the NIH Reporter is going to be used by a couple of the other agencies. So you are right on some things that – they are very simple that we could be able to do something –
PARTICIPANT: I was going to say, going back to the question of sustainability and what is it that we’re really talking about. There is the cost of data and stuff and where does it reside which is a separate question. I mean everybody has to maintain data and that’s always more expensive than you ever want it to be. And so whoever can do it does it. And then there’s the question of well, who is responsible for putting in data because anytime you have any form it’s work. That’s just the way it is.
So the big question is can one from a streamlined perspective when you describe a dataset publish it. And that’s really the time that you do it and that’s current for that dataset. Now the challenge again will always be if it gets decommissioned and you have links, that’s where you have the problem. But they have the crawlers that can tell you if it’s broken or not.
So I think as we think through the components as far as what’s the scope and kind of how does one make it sustainable and even how does it make it desirable. If you can publish, going back to use and the whole health.gov/data.gov thing, that you can actually drive traffic because it’s something that you want. You get credit for it because you publish a paper. Again, there can be incentives that don’t go to financial or having somebody after the fact do it that may make it more sustainable but that’s a broader challenge.
And it doesn’t answer the questions of the challenges of datasets that you have to keep for a certain amount that sometimes are super sensitive. But it’s kind of nice to know if you’re a researcher and you’re getting rent and you have a question and you want to see can re-use. But then you’ve got a problem with IRB and different use and permissions. But anyway – but they are aligning them with the same standards that you’ve talked about with data.gov. And I know Joshua is very sensitive about that.
MR. DAVIS: One or two final things. Access level. We talked a little bit about trying to understand whether a dataset is really truly openly available or if there is some other hurdle and the access level was part of common core metadata where you are supposed to indicate that data is completely openly available, it has some level of restriction on it or it’s not going to be on your Christmas list at all. So that can be really helpful in just immediately understanding whether something’s going to be useful.
I just wanted to touch quickly on the broken links things. I was really resonating with what Rasheda was saying about this catalog that a digital librarian was traversing the internet trying to find datasets for and that job in and of itself is challenging, let alone if you find the dataset you get it appropriately catalogued and it begins to be used and then the link gets broken. You’re now in a position of going back to your colleagues and trying to borrow their time to go find the proper link, come back to the platform and catalog the new link and that is one of my greatest challenges on healthdata.gov. And we’ve tried to federate that out –
DR. DORSEY: And assuming it’s the same person, that you have that same colleague who is still there because when you factor in turnover then you have to start all over again trying to find the right person.
MR. DAVIS: One of the ways we’re managing that particular piece on healthdata.gov is to try to link to the catalog that the agency owns. So CDC has its own data.cdc.gov and healthdata.gov slurps in all of their datasets as a link library to that stuff and therefore the onus is on them to maintain their links there.
Our job is simply to have a reconciliation process that allows us to take links as they are updated from their supply and make the edits on our side. But it remains a challenge no matter what. The broken links thing is incredibly valuable and, as was said, I think as soon as somebody finds a broken link and they are not able to access the data, like they think the whole site is completely useless and you’ve lost them.
DR. MAYS: Let me check and see in terms of individuals online whether or not there’s anyone that wants to comment because what I want to try and do is move us along on the agenda. Is there any comments?
DR. QUEEN: This is Susan, Vickie. I would just sort of echo agreement with both what Damon and what Rasheda have been saying, particularly about the ability to have something that is sustainable, remains up to date and is fluid and then the challenges associated with resources and the ability to have consistent contributions, both in terms of budget and staff and time for something like this.
DR. ROSENTHAL: This is Josh. What I particularly liked about what Damon had said is addressing the challenges and seeing how they solved those with healthdata.gov. What’s the best practice? So if you can address access level, perhaps if we extend the matrix to include the barrier to entry is also cost. If it costs $40,000 for a license feed as it is technically open, that practice that they used at healthdata.gov could certainly extend that broken link, linking the directories instead of links, et cetera, et cetera.
MS. HUNTER: This is Mildred and I support the comments that were made concerning the sustainability of this effort because what is frustrating to my users which are community based organizations as well as some state health departments is that the data is available then when they try to go back and use it, it’s no longer available. I mean that mechanism. I think we need to make sure that we underscore the efforts to make sure that all of this is the same.
DR. BOONE: I just want to echo what Damon said. I actually refer a lot of people to healthdata.gov and many still struggle with finding the value in the datasets and I think a lot of it is attributed to the fact that there’s not a data dictionary made available. I don’t know if that’s – if you guys have done that recently but if you haven’t, it’s something to definitely consider.
I also think that having – and I think we talked about this with the use cases but clear demonstrations of value with the data would be helpful for people. I think they are still struggling on what they can truly do with the data that’s been made available on healthdata.gov.
DR. MAYS: Anyone else online?
DR. ROSENTHAL: Just real quick. This is Josh once again. In terms of implementation, it’s worth thinking about adding the contracting or standards embed, trolling on cycle and I think Jim had mentioned that exposing the data model or ERD and putting together a little bit of benefit. What does it benefit a data producer to expose that? Well, you get good feedback on your data model and it makes your work easier and your life better.
DR. RIPPEN: I guess I have a kind of value question. Given that there will always be challenges for maintaining datasets and for data.gov, is something like this valuable in moving forward or not? Because if we can provide – if it’s not then we can move on. If we think that leveraging kind of some of the core concepts that data.gov has done and expanding them in a more maybe broader way and adding some of the metatags based on the health data framework, and others use them consistently, is there value. Because we do have data exponentially being made available.
There’s a bigger push and so is there value in being able to have some kind of a indexing system to help people find things. Even if it’s a Google or even if it’s another app that sucks it and goes through it and secondary providers. I think that’s the high level question because if there isn’t value then we have so much other stuff to do. I think that’s the key thing.
DR. MAYS: I think the issue of value added and making a business case is probably one of the things that we’re going to have to do. Richard.
MR. LEADBEATER: We kind of moved away from where I thought some of the worries were going but I did want to iterate that the art of the technology today, federal agencies are accessing data storage in a much easier and it’s going to get drastically easier in the future very quickly. To think of a portal – I don’t want anybody to think of a physical portal where the data is going to be sitting in a bucket. It’s a virtual portal. Every agency will be supporting their own data. Hopefully it’s operational data and not put over to aside for storage and dissemination, it’s actually used data because that’s where it has real financial value.
To your point, curation is going to be having a curator, having a librarian is going to be the make or break. Somebody to maintain the lineage, the history, the versions of the framework and that history. That’s probably what makes and breaks a good active database.
DR. MAYS: I was going to change this but okay – Walter, and then Bill and then I am going to move us on.
DR. SUAREZ: So when Helga said data.gov and then someone else was talking about healthdata.gov. So there is data.gov that has health as one of it and it’s not the healthdata.gov, it’s a totally different website. Again, there’s data.gov. One of its elements is health and when you click on health, it’s data.gov with health. Not connected to healthdata.gov.
PARTICIPANT: It is connected. Well, the website doesn’t show it that way but I think –
DR. SUAREZ: Anyway, the interest I really have in all this is more the capability of utilizing API to access the data. In some of the websites, for example, regulations.gov, there is on the homepage a specific clickable item for API developers and for API or app developers through APIs. And so on the home page there’s an actual clickable element that allows quickly access to the API components.
When you look at healthdata.gov and even data.gov, it’s a lot further down in to the system and it is not necessarily clear how easy it is truly interface through APIs with these databases. So I think there is that kind of an opportunity of really emphasize that type of an element in these databases because we’ve been saying, and I was saying before, API is not even the future anymore. It’s the today access. So I think there is something to be said about that and exposing the data in a much more explicit way from the homepage and from an easy access through APIs I think is very critical.
DR. STEAD: One of the things I think I’ve had trouble communicating from the early days of the health data framework that I think may also be getting into the conversation about the matrix is I don’t view this as about somebody manually curating something. And I frankly don’t ever really thought about it as something that would be required.
I think that if we could start by having a standard set of metadata that was easy to use to tag datasets, datasets not data elements, and if it was done in a way that made it easy to externalize or make explicit information about the dataset that the person who builds it knows, that’s very easy to apply. It is very difficult for a third party curator to in any way figure this out.
And so the idea was to have a very lightweight set of metadata, even the whole framework now is maybe four pages. I mean it’s four pages, small type with four columns but it’s not books. It’s not books. And so curating the metadata itself and tracking its versions, that’s pretty easy to do. It would have to be curated. The metadata would have to be curated. The use of the metadata with the datasets would not need to be curated.
And the other thing we’ve learned in other spaces is if you’ve got once – if we construct this set of metadata it in essence becomes what in the clinical informatics world is called a knowledge base and in that world we’ve learned that it in essence is a concept map that then Vanilla NLP can use to actually begin to infer the relationships out of, if you will, the summary block and some of the other things you’re thinking about when they are not done in a human way.
So I think a key here is to almost have a criteria for the effort that it not require dedicated curation. In today’s technology world, anything that requires dedicated curation just doesn’t scale. It works fine for a small set of things but I don’t know –
MR. LEADBEATER: I was speaking at a higher level than that. I was thinking about the system, curation of the system maintaining the links.
DR. SORACE: I actually like those comments, Bill, because I was wondering to what extent you could just automate the curation of the databases and what software tools might be out there now to do that. And it might be worthwhile to run a few of those and have sort of a bake off and see what the – how they’d work on some of our databases, what the outputs might be. I mean that’s one thing you could have. It’s sort of a hackathon approach.
In terms of the value issue which came up a little earlier ago, I think we just have to understand that to actually use HHS datasets takes a lot of tacit knowledge because we do not yet have these knowledge management tools to direct you. And having not used them and then being hired by HHS like 10 years ago, people that have these conversations in the hallway, I didn’t know what they were saying. It just took a long time to learn it.
And there’s no really adequate resources out there for the general populous, and I’m not even talking about the researching, the biomedical research community to understand what NHANES is and what’s the claims database. They just don’t use them. So I think there’s a lot of developmental work on that.
And finally, in terms of maintenance, we have to find somebody who has a budget and workforce that can take a small amount of datasets, like maybe 100, and maintain them and finally we have to tell the agencies that they have to exert discipline when it comes to changing url’s. They have to actually handle that kind of thing. So a few nuts and bolts to put together.
DR. MAYS: So I think these are really good ideas that we’re capturing. And again, some of this is low-hanging fruit that I think – and some of this is at the kind of upper level. What I want to do at this point is – and this got discussed in the earlier meeting, is when it came up – remember there’s all kinds of data.
And so the question is there a particular type we’re going to start with? Well, I think instead the advice that we got was being able in some way to make a visual table and have different types of data then to be able to talk about what applies to what kind of data as opposed to again Jim’s clear about this, we’re clear about this, there is no one-size-fits-all but instead to try and figure out what applies to what kind of data so that we can actually have people be able to see themselves in the advice that we’re giving.
We talked about this use case a little bit in the earlier meeting in the sense that it came up of can we store it with researchers. And I think I’d like to put on the table if there’s anything about other use cases to make our arguments for it. Researchers, I agree with the way it was presented. They are going to be the toughest because they probably have familiarity with some of this but at the same time we need to determine as we do this work and make kind of use cases about why to do this. What the other use cases should be? What other groups would we want to focus on?
DR. FRANCIS: This is Leslie. My sense has been that Josh’s comments I really want to hear in terms of private sector use –
DR. MAYS: Josh, they are calling on you to talk about who else you think would be great use cases.
DR. ROSENTHAL: Sure. Absolutely. So researchers, community, entrepreneurs, established market force players, subset of that would be payers, subset of that would be providers. And then back over, if you wanted to find consumer is kind of a direct consumer thing because people think that’s important. And then finally students. And a subset of that may be non-health care students with peer technology as well.
DR. MAYS: I think our start will be in terms of thinking about the researchers. We also want to think about, if we’re going to think about consumers, I just want to hear – because I think for a start that’s a really difficult one but I want to hear about how to engage them. Mark this is something that in your life that you are very active in doing. Can you talk a little bit about that so that as we consider these use cases we have a sense of – because right away most people don’t want to do the consumer because it’s really hard and it’s too individual. But this is work that you do so I’d like to at least have us think about it.
DR. DAVIS: Mark, can you frame who consumers are for you too, if you don’t mind please?
MR. SAVAGE: My opening comment would be to say there is no one homogenous consumer. I think we tend to talk about it that way and we expect that but the National Partnership for Women and Family’s did a survey, national survey in 2014 and we found people all over the place. We weren’t surprised so we tried to measure it. We tried to oversample in some communities that tend to be under-represented to make sure that we had accurate results for them.
So we looked at things by race. We looked at things by sexual orientation, gender identity, age, education, income, geographic location, and there are differences. Even, for instance, we looked at mobile access and found that depending on looking at things by race and ethnicity, non-Hispanic whites were the least likely to use mobile access to their health information.
So what we took from this was got to design and build for that diversity, anticipate it. Build it in at the beginning so you’re not building barriers into your database or building barriers into your technology. Now that’s from the sort of individual perspective – oh, and I should say, we noticed differences by phase of life. So one of the things we found is that some people wanted some things at some points, like your healthy, you might be looking for convenience features.
But if you’re not healthy, if you have a chronic condition, you might be starting to tap into databases about chronic conditions that healthy people might not. If your healthy but you are a pregnant mom, you might be tapping into information that the healthy person who is not pregnant isn’t tapping into. So just again, thinking about different phases of life and what are the interests.
So that’s at the individual level but I think we should be thinking about consumer engagement at the community, at the population level as well. So we have done a lot of work on health equity, health disparities, starting to aggregate data. We’ve built upon the individual level of patient generated health data helping patients to be able to contribute data to the medical record. Another way of looking at that is that okay, so now the doctor actually is getting access too to important data, like we were trying to get consumers to have access.
But that rolls out more broadly to social determinants of health. So patient-generated health data at an individual level is now also being seen as the vehicle for bringing in social determinant data into electronic health records and in order to understand the 85 to 90 percent of factors that are accounting for health status that are not in the clinical setting.
My team supports Academy Health on a grant from ONC where we are helping 10 different communities across the nation address a community defined population health challenge. They are trying to exchange data across multiple sectors. So it’s designed to show interoperability, solving problems, addressing health equity.
And they are doing different things. Some is care planning. Some is about combining, looking at asthma across communities. Some is about looking at how you can provide better treatment for people who are incarcerated. When you start thinking about consumer engagement as a community issue then what can be useful, what people need, what they are going to be searching for expands in multifold. And I think the Academy Health example is a good place to look at to see what some of those needs might be.
Ramping up even further you can think of this as sort of community created big data. And I would just throw in one of the issues that’s going to be important from a consumer-engagement perspective, it’s a talk that I gave at a conference led by the Leadership Conference on Civil and Human Rights back in 2014 that there’s a lot of concern about how big data can be used for profiling so some of the datasets that we would have at HHS people may worry about that. At the same time, big data is what helps you identify and reduce health disparities. So it’s not big data itself, it’s how it’s being used. Again, something to think about a big important topic around consumer engagement.
So what I take from all of this is we’ve been having a conversation about how will different consumers use HHS data. How do we help them use that? But I want to also posit that the communities are creating data that HHS is going to want to access and incorporate into the data that in turn comes back and becomes useful.
And perhaps a good example of that is the Precision Medicine Initiative where the White House is turning to a cohort of a million people to gather data to create a research database to in turn provide precision medicine. In other words, we are turning to the community to create the database that will be useful to the White House, to HHS, to researchers and so forth. So I think of consumer engagement as a – maybe as a circle or a cycle as well. It’s not a one-way street. Just some thoughts.
DR. ROSENTHAL: This is Josh, just real quick. When Leslie had asked me that question who are the users, I rattled off a litany. If we’re thinking about it strategically in approach what I would urge us to consider is who are the users today and we looked at that last time, that it’s tens of millions of people using HHS data. Most of those consumers are not using it correctly. They are using it through secondary and tertiary people. And so like literally should HHS even should that be a priority of it.
If you believe in the whole Todd Park Open Data, Health Datapalooza, of wanting entrepreneurs to create that, to fill in the gap, have a 1000 flowers bloom because HHS can’t do it, and those are the means that people who are accessing HHS data today – that’s a magnifying effect that’s probably worth thinking about if there is that sort of disparity in terms of who is actually using it.
So I would squarely look at entrepreneurs on that side and at least ask the question, like where do you get the most bang for your buck and the most usage and those kind of secondary and tertiary magnifying factors.
DR. RIPPEN: I just want to build off of it. If you think about if it’s associated with metatags or metadata that goes with the data, it can also then provide an opportunity for the government to actually be able to understand how their data is being used. It goes beyond. Because sometimes you know you don’t get the credit if one user is hitting the database but that one user may have a distribution channel of a million people. So again, going back to how does one then tie it back to the source too.
DR. MAYS: Let me just make sure that the other couple of people who were going to comment on this get a chance and then I’ll open it back up for questions. Mo, do you have any comments about the consumer engagement issue? You might be on mute. Can you tell me if Mo is there? Okay. Mildred, are you still there?
MS. HUNTER: Yes. Let me begin – and I’m preface my remarks by indicating that the consumers that I’m referring to are primarily consumers at the community base level, both who are community-based organizations and those entities who are looking at data to address health disparities and health equity, to use the data for policy issues and then also to access the grant opportunities that are available from the public as well as private sources. Let me also mention that we not forget the tribal population because the tribal population are – I mean the American Indian populations are also ones who are concerned about the absence of data.
I’m going to give you just maybe one example or maybe two. A work group that I have is the Region Five Minority Health Interstate and Tribal Data Quality Work Group.
DR. MAYS: Mildred, can I make sure that when you give me the example that you give me a sense of how to engage them, the solution.
MS. HUNTER: And that work group is primarily to address the data gap, health status data gap, the racial and all racial and ethnic minority population where the data does not exist. And so we have been addressing the issues of data and those that have been discussed on this call as well as the meeting that was held earlier this week.
There are several – and to just respond to your question, Vickie, this is a venue whereby – and let me just go back and tell you the participants on this work group represent state health departments who generate data for the minority health program in their state. And they have convened a work group and information sessions in their state as well as the state offices, the minority health in this region that exists across the region as well as local offices of minority health in Ohio.
And these entities have held information sessions on data as well as provided town hall meetings so that they can get feedback from consumers at the very local and the granular level at the community in terms of data use, data access and data needs. And so I would support that and if you need a venue or a mechanism to engage and to get community feedback that certainly in Region Five we have an active group that can do that as well as my counterparts across the country that exist in all of the regional offices. So I’m going to stop here and entertain any questions.
DR. MAYS: I think the comment about consumer engagement at the population level probably if that were to be a use case is probably where we would go. But I think we’re going to start probably with the researchers. So let me take one more comment –
MS. HUNTER: Let me also add that in some of the community-based organizations there are also the researchers that are involved.
MS. MAYS: Okay, great. Thank you. I think what I want to do is be mindful of the time and use the time that we have left to deal with the two big pieces that we have. One is the guidance document. And the matrix.
So part of what we want to do is try and think about how these things work. And Erika, this is where in terms of thinking about the guidance document and then Helga in terms of thinking about the data matrix. So I want to tell you my thinking and then it would be helpful to hear the two of you and then we’ll want to entertain comments about how we move ahead.
So the vision that I had is that the guidance document really helps us lay out principles. It would in line with what was talked about earlier today. We would have some sense of different kinds of data and that some of these best practices, suggestions, commentary, et cetera, will vary depending on what kind of data it is. And in the guidance document, as you can see, if we turn to that, there’s a series of questions that we have posed because it’s like how broad are we going to get, who are we going to talk to, how will we get us down to a use case is really important because we only have a certain bandwidth.
We don’t want to live through this for years to come. This is a field in which we also want to turn this over and collaborate with others. So if you look at the guidance document, for example, what are some of the potential topics? Okay, a conceptual framework for increasing our use and access to health data. I think that – and I’ll let Helga speak about it but she sees the matrix as the conceptual frame and I think at the end of the document you will see that there are some conceptual frames that we’re actually proposing that are kind of the overarching principles.
So look on page 10 of the guidance document, potential conceptual framework to adapt for the report and characteristics of data use, data quality and usability, health impacts, and again, it fits with some of the things that have been talked about in the matrix. Some are there in the matrix and some aren’t. And I think that this comes from the New York project where they move the public health data into an open data format.
And then there’s also our typical frame of the health statistics vision. So there’s been kind of some thought about how to frame this so that we have not only this is what you do but some principles where people have to make decisions, some organization in terms of when people think about it is that needs to be done. So if we go through the guidance document it’s like there’s a conceptual frame, the issue of characteristics of the data and data users, how do you create high quality and usable data, the issue is there in terms of metadata. So this would give you ideas whereas my sense is that the matrix is there to show you the specific things you should be considering. But for some people they have to be convinced to do this.
For some people I think they have to have an idea about what the principles and goals are. We need to make sure for people that we’re covering things like privacy, security, confidentiality, some of the access issues, how to help them think about again engaging different consumers.
And the issue of implementation and sustainability. For those of you who are on the call, the last is probably one of the most critical and that is how do we change the culture. And I think Rasheda brought this up in terms of some of the examples of we have to think about the resources, we have to think about the feasibility, we have to – and I think we need provide people with some talking points on how to do those things.
So for me, it was that these are the things – and we don’t do them all – but these are the things that we actually kind of write about and offer some guidance in terms of it. The matrix is utilized in the context of examples. That’s where our health data framework is useful, for people to be able to look at and see kind of what are some of the domains that people are thinking about and like how would I do that in terms of my datasets. So that was the thinking that I had for how these two things work together. Let me give Helga and Erika an opportunity to talk.
DR. RIPPEN: So again, I think that the overarching kind of guidance framework that you’re talking about is really kind of best practices. So if you are going to create a dataset, how do you do it, what are the things you have to consider and then how you make sure that it’s meeting the mark from what I understand.
The matrix is more from the perspective, and again, it depends if we want to change that, it’s been from the perspective if I’m going to post data, so I have data already. I can’t change anything about it. I don’t have the resources. I have data and I want to make it available and I want to kind of characterize it in a consistent way. So basically that’s why the question of who’s the data owners so if there’s a problem you know who to contact, you know who produced it because it’s not always the same, again kind of talking about the attributes and kind of the historical perspectives of the data because you can’t have someone curating it because it’s too expensive. Bottom line, it’s whoever creates the data and they want to share it, they are going the effort of posting it, hopefully.
So then there’s the theme – and the theme really – and actually, working with Josh, we put it in an Access database and it’s one pager with drop down menus so you don’t have to type in stuff because any kind of extra work decreases exponentially the likelihood of it being filled out. So from the theme perspective, those are dropdowns, just click whichever ones you think fit.
Then what’s the purpose because if I have a purpose as an agency for certain types of data that’s why I collected it. And everyone needs to know that that’s the bias of the collection of the data and so here we have the research evaluation, was a planning payment and going back to making sure that we’re using consistent forms.
The audience, if it was for me to set policy, I’m going to say it’s for policymakers but that doesn’t mean no one else can use it. It’s just the context of why it was actually collected. The geographic scope, so now I have a tag that says well, how granular is it or is it not. The funding, that’s important because if this is an unfunded mandate and it happens to be one shot, it’s good to know because I’m not going to as a community decide to take a lot of effort to continue so I’m not going to be surprised that next year it’s not going to be there. So again information that provides that insight.
Then what level, because it would be federal because this is what we’re talking about here but it could be state, it could be a local group, it could be a nonprofit. And then how critical was it to that organization. That’s so that if it’s critical you know it’s going to more likely be around for a while or you can make a judgement.
A data source methodology becomes very important because you know how willing are you going to be able to trust it and again, it’s not from the consumer perspective if it’s someone that may not know that the difference between a direct survey or interview survey or email survey or that it’s an output of an EHR or an output of a billing process. So really high level sort of things with the ability to provide detail if you want to and if you have it, obviously the more the better. But again, trying to balance it.
Data currency, so what dates are we talking about is an important one. The latest date that you have, the publication date because sometimes it takes a year, two years to publish from when it was actually what we’re talking about the timeframe. How often you talk about refresh. Lifetime expected end date. And again we can cut these, add these depending on what people think is important.
And then the data availability, some of the things that actually should go over some of the terms that you use on the data.gov component. We had open source requests. It’s free but you have to ask for it or it’s going to have to pay for it or is it a de-identify, what level. How can it being a paying because if it’s not available and you have to ask, how do you do it. And then data quality. Data quality is two different things. This is the perceived data quality of the people posting it. It’s not an evaluation. And then how did you validate the data because again, it’s perception.
Then the data details, it’s just really kind of what’s in an observation, how many data elements, file format, data dictionary, do you have one, do you want to link to it. Is it machine readable. Number of entities, file attributes, geocoded, again this is a little redundant.
And then also does it support API. Going back to those kinds of things and then if you want to add things. Now, depending on how you implement it, then you can always add additional things depending on the philosophical approach. So if you have kind of this library, you can have users and go back and talk about well, was it useful or not, how did you use it and all kinds of things.
But again, it goes back to what’s the vision of how this may or may not be useful to people who want to consume data. Because this is by the person that’s saying hey, here’s my data, come and get it and here’s a summary so that you can quickly assess whether or not you want to do more effort.
And Josh, I know that you had then, when we did the access database, you kind of kicked it around and asked some of your team to do it. Do you want to comment?
DR. BOONE: Just ever so briefly, we ran it through everyone from MPH and MD students to PhDs and computer science looking at kind of the top various different data sources, probably about 40 of them. We elected not to put them in and share them today just so we didn’t get kind of caught going down rabbit tails of particularities. But on effort it took about 10 to 15 minutes people putting in the top basic sources.
And if asked for the difference between the overarching guidance documents and the matrix, the matrix answers the question of what do I need to include with the data release to ensure that it creates the most usage and grants the most access to the most people. We looked at HHS sources and said why do some of them have 50 users and some of them have five million users and these are the things that help in a push model as well as a pull model which gives you much greater impact in terms of usage.
DR. MAYS: And then Josh, I didn’t know if you wanted to add anything else if I missed anything.
DR. ROSENTHAL: That is great. That’s fantastic. We need a concerted effort not to include a bunch of things, trying to keep it as simple as possible. And going through those data sources – not every source fit all of these variables as very open in terms of kind of being a flexible framework or matrix. So some sources fit here, some sources fit here. And it answers a core thing.
If I’m looking at this, if a source cost me $40,000, even if it’s unrestricted access, it’s not so helpful to me. And these are the things that get picked up by autoscrapers. This is what gets picked up in distribution systems and it doesn’t need to be manually managed or curated. So do this and all of a sudden a lot of people start using your stuff and building on it and putting it in to their own distribution.
DR. MAYS: I think for the data owners that this is a very organized, very streamlined approach for them to be able to input the information and get it out there for people to be able to access it. Erika, can you talk about in terms of the guidance, some of the questions that have been posed and kind of how that allowed New York to be able to move from where it was to actually making its data more available by having some of these types of information for them.
MS. MARTIN: One thing that I see between the alignment with the guidance document and the matrix is I see the matrix as also more of a metadata quality. I also guess I should say I think it goes back to past comments that we had about how we should be clear on terms. So for me I also thought that the matrix is about ensuring your data gets posted in a way that you engage users and you transmit information about the data.
So what’s happened in New York is they have something similar to what Josh and Helga talked about the matrix. So they have a metadata form and every single data owner needs to fill this thing out as part of releasing open data, it’s just part of their business process for releasing it. And the process of doing this has actually been a few things. So one is that it has standardized all of the metadata so now as a user you can go to the open data portal and you can really easily sort across datasets and decide as a user which one makes sense for you.
But they also have some kind of readme files that they have with suggestions as how could people use the data and they think that that’s actually allowed the data owners to reflect a little bit on what might be some creative ways the community could use the data and that’s actually helped them generate some ideas for codathons and other community engagement activities.
So I think that New York has sort of found it helpful to have something like this matrix document and we can certainly ask them for input on what they might want to update or change with the form.
DR. MAYS: I think in terms of the matrix they designed they actually used your crosswalk and it’s more in terms of the side of thinking about giving this information that comes in the guidance. Can you talk a little bit about that?
MS. MARTIN: So, the guidance we can think about the analogy in New York is that they had an open data handbook that came up with similar kinds of things. So I don’t know that they necessarily started with a conceptual framework but the idea of the handbook is that it laid out these sort of principles and ideas and things that they wanted their data to become and so that ended up both getting into the metadata but also –
MS. MAYS: Erika, are you on a speaker phone?
MS. MARTIN: I’ve been hearing the same feedback with others. So, my understanding from the New York story is that by putting together something – in their case it was the open data handbook. I think in our case we’re calling it the Framework Document, it’s helped with the vision and it’s also helped them mobilize a little bit.
DR. MAYS: So, I think what we want to do is to kind of have a discussion of what the product is or the products because it isn’t just one thing. Ultimately, we want to have something for data owners and I think that the matrix seems in both sides that that is useful. It helps to standardize. I think it becomes something that is – I don’t want to say quick but it’s something that should not be burdensome. So I think in terms of how that’s setup that that sounds great.
I think that we want to think about what advice that we want to give the Secretary in terms making data more accessible and usable. And that seems to be some of the issues that are raised here. I think that as Susan Queen is well aware, we’ve raised some issues about privacy and confidentiality in terms of where some of the data that’s there. I think that depending upon what type of data we’re talking about, there may be some recommendations that we want to get to the Secretary so we have to figure out the process to be able to have that.
I think in the past what we’ve talked about and Kenyon has been kind of front and center about that is the ability to actually write about these issues so that we get industry more involved, that we have things that we put in industry newsletters. We have things that we put in academic forums so that the issue of some of the best practices and why to do some of these things are also discussed. So that was on the table. Kenyon, am I missing any other things that we had talked about as possible ways for this group to consider?
MR. CROWEY: I think the other thing I would mention as part of sort of fostering the dissemination and communication through industry is the social communities, the communities that are using this data, the communities that are sharing insights, the communities that in some instances post their results, post the methods they used for the result. With many of these datasets there’s a learning curve in how to optimize the use of them and the tools that are using them, whether it’s R or Python or whatever.
But to have ways that the community is sort of fostering their insights, what they’ve learned. How they’re using it? What’s worked? What’s not? What variables may or may not be available? The actual hygiene of particular subsets of the data for example. So as part of all this, the whole sort of social community strategy and where we’ve talked before about the metadata and being able to understand what users of certain data have other data they are looking at and how you might make that visible and accessible to individuals in terms of this sort of findability part of the data I think is important.
DR. MAYS: We talked about that as a potential blog, whether or not we could get people to blog on that as a way to kind of have it in real time and available. And Josh actually designed a prototype of the blog. But that has to be kind of outside of us. We have to find someone who will kind of take that over. It would be great if we could get Academy Health or somebody to be the site for that. So yes, we talked about that.
MR. CROWEY: I think this is great work to encapsulate a lot from Helga and Josh and Erika and a lot of others have put into this. I think it’s comprehensive enough at this point where you just want to get it to users sooner rather than later, start getting some sort of real world, experiential feedback, running some cases through it. And then start tweaking.
DR. RIPPEN: We actually are looking at just from a test case perspective to your point is if we could provide a way for people to enter it within a common system, just the summary stuff. Just so that we could then take a look to see if people had comments, put comments in it to get feedback in real time. What do you think about that?
MR. CROWEY: I think that’d be great. If there’s some sort of lightweight way to survey online form –
DR. RIPPEN: No survey.
MR. CROWEY: Not survey, data collection tool. Checklist. Feedback. Something easy lightweight.
DR. RIPPEN: I agree.
DR. MAYS: And to have it done systematically enough that when we give feedback to the Secretary we’re real clear about what we did, who we did it with, and how broad it was in terms of then turning around and asking people to make significant changes. Let me hear from others, what do you think about – oh, Richard has his thing up.
MR. LEADBEATER: You were talking about case studies but in talking to the Secretary I think it would behoove us to have an additional column saying what federal program the data supports. I’m looking at data from coroners and I’m thinking the opiate programs that the White House is putting out. That justifies that dataset. Planning and engineering data, streets and sidewalks, Department of Transportation has mandates for disability access to sidewalks at intersections. That data justifies, ties right into that program. You could go down every one of these datasets in this document and find the federal mandate or clinical mandate that it solves. But looking at him as a customer that’s in need of this.
DR. MAYS: Other comments, thoughts about where we’re going?
MR. SAVAGE: Building on that last comment. When I did some work with one of the HIT policy committees where we were trying to rank some use cases and we were trying to identify priorities, one of the things we did was to rank them against the national quality strategies, those six domains. And it might – just popped into my head, it might increase usage if people saw that a particular dataset fed into a major national priority and that this was the way it was relevant to that major national priority.
DR. MAYS: That’s really interesting. Leslie, you were going to comment?
DR. FRANCIS: I was just going to say that the more we’re focused the better. All of this has taken a tremendously long time and it’s nice to actually see – I think because this is about data use –
DR. MAYS: You’re coming in and out.
DR. FRANCIS: What I was going to say is since this is about data use, what I would really emphasize is what data uses do you think are the most important in deciding how to prioritize.
DR. MAYS: I just want to make sure because you were coming in and out whether you’re still talking or that was –
DR. FRANCIS: No, I put myself back on mute.
DR. MAYS: Oh okay. Other thoughts? I guess I want to comment again about the kind of what Mark was saying and I think you’re bringing it up as well in terms of strategically it seems that having people see what this is used for and how it can even go beyond – so the data owners can see it but we also I think having the data users see it. If there’s some way, for instance, if a community saw ah, there’s disability data in here –
MR. LEADBEATER: I think if you identify digital or data-driven justice, both audiences immediately see why one day they have to start collecting it, start collecting it better and then the user sees oh, this is why I use it. So I think finding that program on why the data needs to happen, why it even gets collected or reported, that could be done in the one column of what program could it support.
DR. MAYS: How feasible is that, I want to ask like Rasheda, Jim? Do you know what – or Damon – do you know what your data is used for and what it is that it can support that lines up with priority initiatives?
DR. DORSEY: I’m not sure we have – there’s not like that crosswalk that I can think of that’s coming to mind when there’s this comprehensive listing that describes our data in that way. I would say that the people who work in the programmatic and policy office that’s focused and specializes in a particular area, they have a good understanding of what those data are and also data that are connected to like Healthy People, like the data that support the Healthy People indicators. When you look at our strategic plan, some of the data that are used to support some of our agency goals, I think those are a little more wide known but because of the wide scope of the different areas and topical areas that we cover in HHS, I don’t think that that actually exists.
We’re doing some work in the data council, through the data council looking at some of our administrative data and I believe Jim provided an update on the Commission on Evidence-based Policymaking and their interest in maximizing the use of administrative data for research and evaluation purposes but also for the purposes that are outside of that intended use and how it was collected.
And so that’s also newer for us in regards to definitely our administrative data because of the barriers to sharing, some of them are legal, privacy and so on. But I would say that the interested stakeholders know. So the people who work in the area of tobacco, like they know what tobacco data are available but they might not know what’s available if you’re interested in looking at like alcoholism or what have you.
MS. HINES: I would say that one way you can explore that question is to look at mission. So, for instance, there’s the HIV surveillance systems and various surveillance systems that you look at the mission of the Optiv, in this case CDC. You look at NCHS, they have surveys. That’s part of their mission. They are supposed to provide nationally representative estimates. You look at HRSA that’s got grantees. You look at NIH, it’s got research.
I worked extensively with HRSA grantee data. All of that helps the agency form policy around how to run the grant programs. The grantees provide some kind of periodic reporting back to the agency, that’s administrative data but it’s very specific to the running of the program and the program management. So ARHQ collects its own various forms of data and so forth. FDA, which I’m most familiar with. Indian Health Service. If you go by mission, you can almost sort of come out from the federal perspective of what the purpose of each category of data are based on what the mission and the legislative mandate to run each of the programs are. That’s one way you can look at it.
DR. QUEEN: Can I jump in for a second. I was just going to say it’s very interesting to me having moved from ASPE to NCHS working on issues related to budget and transition materials, I am seeking where NCHS data has been used. This is not by the public necessarily but practical uses for the development of a recommendation or guidelines or aside from the prevalence estimate that we produced and I’ve been coming up with concrete examples for how the data are used for public health. And it’s been a time intensive exercise but one that I think that’s highly valuable to demonstrate the practical applications of the data.
MR. DAVIS: How are you going about it, Susan? This is Damon.
DR. QUEEN: Oh, Damon, you wouldn’t – some things are obvious because NHANES data are used for dietary recommendations and guidelines, blood pressure for children. NCHS data were widely used in the 2016 Opioid Prescribing Guidelines that CDC put out but if you read those guidelines you would have to know that NCHS data were being used because NCHS isn’t specifically referenced. And it’s a challenge, but part of it is for budget in a time of declining resources trying to point out the value of the data. So it’s somewhat intensive but I think it’s going to be very worthwhile. I’m finding things that I didn’t know about the data here that have been used and it is actually kind of fascinating. But it is time intensive. It’s a lot of searching.
MR. DAVIS: Not requiring attribution makes it incredibly challenging to track those down.
DR. MAYS: Let me just ask Josh a question because I know Josh has actually provided Jim with some information on this. Josh, how do you find the nonfederal use of federal data? Like you know about Google –
DR. ROSENTHAL: There are various web analytic things that are usually pay that you can basically go in and hit breadcrumb trails even without authorization. Some of those are free, easiest is Alexa. Part of it is just knowing, just being in the space and knowing, hey, ProPublica has a FEMA datashop where they clean up all the data and make it usable. You buy it for $30. I hit Alexa, I see, oh, there’s 2.2 million users of that. That’s great. Or knowing, even in my day job we work with US News and World Reports that’s using HHS data. All right. They had five million viewers of HHS data last month.
And then part of it is looking for these metadata breadcrumb trails that Helga was talking about a little bit. If I see something in a particular case on an XML format that is in, even referenced through something like healthdata.gov, I can pick it up through a scraper. We have custom built scrapers that we use for things.
And so that allows me to kind of see, hey, somebody used an HHS dataset. There’s a core trail there. They wrapped it with their stuff. It got picked up in somebody else’s distribution. That’s sort of how you can figure out that Google is using it, Tableau is using it, et cetera, et cetera.
Part of it is just kind of thinking about it. Instead of thinking about users and consumers of data, who visits the destination site, where might this stuff be in the ecosystems of who is using actually healthcare data, looking at that and finding identifiers. And part of that information we shared that is up on the slides, there’s literally millions of users of data through secondary and tertiary systems which is sort of what at least Todd and Company had talked about when they talked about data liberation, not doing it themselves but putting it out for others to do it.
So you would expect that if it worked you would have massive amounts of users there but they would not be identifiable through kind of poll-based analyses. And Eric and I chatted a little bit whether New York State or what have you, or actually your grad student, most of the systems that talk about traditional users are doing poll-based analysis instead of looking for crumb trails and scrapers and distribution systems where all the users actually are. So you get a distorted view and it’s always much smaller.
Like a few things come out of looking at that. One, you underestimate who is actually using your data. Two, it’s used towards people who self-identify and lock down systems which would typically be researchers over other folks.
DR. MAYS: So, Susan you may want to get the information that Josh sent to Jim to help you with some of that. Let me take Helga’s comments and then what we’re going to do is actually talk about what our action steps are. Helga.
DR. RIPPEN: And again, just trying to build off of kind of your point Vickie as part of the case. So what is an example area of focus. So you can do policy. The problem is we’re kind of at a shift in political hats. The other is to look at something that’s a little bit agnostic and state and nationwide. So, for example, obesity. So you could in theory say if you wanted to get all the data that we had on obesity and actually leverage the PopHealth team in the sense of all of the different sectors were there for the different agencies. And say, hey, do you have anything that might be related to obesity and you would have transportation maybe chiming in. You would have maybe some on the Food and Drug. You might have agriculture. There’s maybe lots of different groups that the different agencies might say, oh, well, we have something that’s related to it. And then you know Robert Wood Johnson and others, because you were even saying if we get a wide reach out and so you get two for one. You get the feedback on hey, is this the best way to characterize kind of the summation or the metatags or however we want to call it. But at the same time you’re getting some interesting information that actually people might be able to use too. And getting feedback at the same time. So it’s just another way to think about kind of what you were highlighting.
DR. MAYS: I think we have several things on the table. And what I would like to do is – I’ll get to be like Bill which is like let’s move it out there and get it done. So what I’d like to do is for us to first think about using the health data matrix, I guess that’s what we’re calling it. And getting the assessment of that pulled together. So I think you’re in access already, right?
DR. RIPPEN: I’ve already updated the data element too. I already put in the new one, version 3.0.
DR. MAYS: As Bill says I know they are going to make some changes but we don’t have to really wait for those changes. We can go ahead and figure this out. So what I’d like to do is start setting up a series of calls and figuring out what are – and I think we really do want to hear from Rasheda or Jim because they have a sense of which data owners they may want to do this with and Damon, they may want to do this first. Particularly, I know Damon says, it’s easier when people are in the midst of knowing they are going to make a change soon as opposed to those who have just finished. We might get a totally different response. So I think we want to work with them on selecting a sample. I think we want to think about producing that visual that Linda talked about earlier so that we can have a sense of the sample, whether it’s going to be just the surveys or whether it’s going to be other specific kinds of data so that we can get a good –
DR. RIPPEN: – or area of focus. So if you have obesity then it’s whatever methodology or you check the methodology. The big question is can we do a two-for-one on a topic area that might be of interest in the broader perspective.
DR. MAYS: What we could do is decide that and ask them to fill it out about the dataset and then have them do also a specific area. And part of what we also wanted try and do is get a sense of how long does it take to get them to fill this stuff out, is some parts more difficult than others. Figure out if we take an area, what’s the best area for – we can Rasheda, we can ask Susan, like if we pick a topic, let’s pick something that’s going to be very useful for them and that we know will go through no matter what who is here next time but that is going to stay as a priority.
And you’re right, obesity, probably the opioid might be a big one as well. I was just going to say that one you might actually get the data owners. We want them to actually want to fill this out as well. And it may be something that they know that the Department is very interested in so that might be actually one to think about because that would be an instance held to everybody. So I think that that may be it.
So I’m going to suggest that we start there and that in terms of where we are with the guidance and the pieces there. Those are going to be some longer conversations that we should have but I think at this point getting this started we can see what contribution it definitely can make in a short amount of time and I think then in terms of other pieces. I would love to be able to find somebody, Kenyon, who could just help us do the blog that we have talked about because that’s when we can start with this real time of picking datasets and having people say this is how I solved this problem.
MR. CROWEY: One thing I was thinking about earlier, we were talking about the audience and wanting people to do this, we might also want to think about so what are the behavioral incentives we can embed into some of these systems that would incentivize people to want to do these things if we’re talking about having social directories or sort of the community to populate these things. What are the mechanisms that might be important to actually motivate or incentivize that? And some of it might be when the external community – but some of it also might be within HHS.
DR. RIPPEN: I think it’s attribution. It was a citation because they needed to know like we heard earlier, who used our data, because we are going to have to ask for money to maintain that database. Well, this agency used it, that agency used it, plus external groups used it. So it’s a question of attribution.
MR. CROWEY: But it’s attribution, she’s using it, why they are using it. If these are all the great things that you’re supposed to be doing, who is doing all these things and are you getting credit for that and is that being visible to their stakeholders and their peers.
DR. MAYS: I think that would actually be very good for us to figure that out in terms of kind of to do one of these –
MR. CROWEY: Well, when you see like Stock Exchange and Core and all these sites we talked about in the past that like generate a lot of engagement with their community and one of the reasons they do that is they provide sort of these opportunities to incentivize behaviors.
DR. MAYS: I think if we could think about some of these behavioral economic incentives that would be great. For example, it may be as simple as social tagging in terms of the users. They get to tag and then they get to have other people realize that they – I think that that would be worth some planning in terms of thinking about –
MR. CROWEY: 100 percent of your datasets are compliant with data framework or meet the guidance.
DR. MAYS: Something that helps them in some kind of way to be able to justify – this is where we also will talk with Susan at some point because it’s like, Susan, the issue that you all deal with your surveys results kind of dropping. Is there something that we’d be able to do in terms of embedding for the public the value of the survey so that when people get called they are actually like, okay, I’ll participate. We should think about that.
MR. CROWEY: Did you notice they actually have something on data.gov where it says submit your data story? I don’t know if people are actually using that. Under each dataset it has submit your data story. We might want to see what kind of data stories are coming into there or even if they are at all. Because if they are having some success with people submitting those data stories then that might provide some fodder or something we might want to leverage. If it’s an easy thing – every federal dataset might have a widget on the page that says, how do you plan to use this data? Just a little widget, it’s code this long. Just put it on the website, then you can have this long feed. And a lot of people will just put it in if it’s easy.
DR. MAYS: You could also – and some people would be willing to do this. You could ask them to put a little Youtube about how you use this data, that then – and these are the things that will potentially drive people when they get called to participate in the surveys to actually do it because they’ve heard of it, it was a hot thing. They saw a little Youtube video. We’re going to put these things on our agenda but I think the first thing we’ll do is have a planning call to get started on how we’re going to move the matrix out. Anything else before – I want to kind of run just a couple of other things so anything else about what’s next on our agenda. I don’t want to try and do our work plan yet because I want for Jim and I to sit down and really see what his bandwidth and things like that are so rather than sitting her committing it all up, we’re going to do kind of at least this one thing which I know that is quite doable which is to start with that and then we’ll kind of look at some of the other things. And give us a couple weeks. He just kind of started – we just talked about it a week ago so let’s not run him away with too much work too quickly. I’m really looking forward to his participation. He’s been great already.
DR. SUAREZ: I think as you develop that work plan, there could be identify linkages to the subcommittees, privacy, standards –
DR. MAYS: Yes. Yes. Because one of the things that – and she didn’t get a chance to talk about – one of the things we asked Leslie to do was to talk to Mya, to Linda, and to Susan Queen, about what privacy, confidentiality issues are. It’s clear that there is some standard stuff, that’s standard stuff, not standards, that will come up that we want to think about. Pop is always there. So Pop has been there so I think we’re okay.
So I can see what our next directions are rather than making it like the subcommittees, where they are planning out for two years, let us get our first thing done so we can have a sense of accomplishment. You don’t know how many of us really want that. So thank you Josh and Helga. We’ll have our first little – well, we had a letter that we did which was painful, but I think this will be actually more fun in terms of getting it accomplished. And then we’re going to try and sequence to other things and see how all these things can work. Let me give people online –
DR. FRANCIS: I just wanted to jump in to say that I did try to get in touch with Mya and with Linda. I didn’t email Susan because I didn’t have her email ready in hand but I think I do now. But I didn’t get anything back on privacy from either one of them.
DR. MAYS: Oh, then I don’t feel as bad. I thought you were all prepared with all this stuff so that’s even better. So that’s on your to-do list then.
MS. BERNSTEIN: No, it’s on mine. So, if I may, yes, Leslie was very kind to write to me and she happened to – I was wanting to coordinate with my co-chairs, one of whom was in the middle of a move and the other one – so haven’t got it all coordinated yet to have a good answer for you because we just couldn’t pull it together in the last week or two or whatever since Leslie wrote. But it’s not Leslie’s fault. It’s on me and our committee folk. So thank you, Leslie.
DR. MAYS: Okay, so I feel better because we didn’t get to it. I just want to give people online a chance for any other comments, anything in particular that you think is on a near agenda rather than a far agenda. We’re very clear that we’re going to start with a matrix. Any other things? Susan, we will loop back to you because I think that there are a couple of things that we have that sounds like will be useful to you and then we also want to make sure that we get a sense of some of the priorities in terms of NCHS.
DR. QUEEN: Yes. Okay.
DR. MAYS: Great. Thank you for being on. Anyone else, final comments, questions, anything that you want to say. Our next thing that we will be doing is sending out a poll for a conference call. So online, any comments? What I’d like to do before we end is just run the table and see if there are any comments left, anything that you want to say? Let’s hear the gavel and thank you very much.
(Whereupon, the meeting adjourned at 5:00 p.m.)