Protecting data from the public and ourselves

It’s JAMA time, baby! Junk science presented as public health research **

This post is by Lizzie.

It’s JAMA time, baby! Junk science presented as public health research **

This post is by Lizzie.

Many years ago I received an email from some ecologists I had never met who wanted to use data from my PhD (remember my PhD field work that I told you about the other day)? Oh glorious day! Me and my shrubs were out in the sunlight of other researchers’ interests. And, better than that, all my data was already public so when they asked for the data I just pointed them to a link and wished them well. After some back and forth they asked me to read the resulting paper to make sure they had not mis-used or mis-understood the data from my PhD. I thought this unnecessary but read it anyway. I found nothing to complain about.

All I recall is that I didn’t love that they called the field site I worked at ‘a moonscape’ so I suggested they change it to ‘shrubscape.’ This was partly because I think Mojave desert when I think moonscape and I worked in a coastal habitat with more vegetation (see photo), but also because I was on a total kick of trying to get more folks using shrub as an adjective (I was thrilled when I published about ‘shruboreal arthropods’). This brings me to point 1 I wanted to make about data sharing today.

The idea that we should all contact the folks who produced data to make sure we use it correctly is misguided.

It’s a nice idea and I don’t want anyone to think I don’t enjoy meeting the people who produced data that I use (or really anyone uses, I watch movies to see the CERN people in their hard hats producing data and I would enjoy talking with those folks too) or that I don’t realize a lot of things can make data less straightforward (when I was a student we called these ‘demonic intrusions’). It’s that the idea leads to small (and more often less amazing) science. I think this is especially true in ecology where we spent decades sharing stories with each other of what we each found in our particular system (side shout out to Ryan Thum who will forever live on in my memory of him talking to me during the first year of my PhD and saying: if you truly want to advance ecology, then you should get two wheels, one labeled with different systems and one labeled with different questions. When each PhD student shows up they spin each wheel and that’s what they do their PhD on).

Think about how big the science you can do is if you require everyone who wants to use data produced by someone else to contact them. I definitely am not analyzing tree rings as a topic! (But shout out also to all the researchers who have all replied to our queries about their public data, including showing us photos of their western redcedar tree cores when we were incredulous that they could have 2 cm of growth in one year). No more global analyses of advancing plant leafout for the IPCC! There’s the obvious logistical problems of the effort to do this (including all the dead people who produced data that is now public), but also the issue of what we would ask them exactly. As a researcher who does large-scale analyses using other people’s data it’s on me to make sure I validate the data, understand the data and read all the metadata. Is this a way to skip those steps and just get an emailed approval check-mark? Or is there so often missing information—that I would somehow not find in the process of doing science (visualizations, analyses, etc.) and the researcher who shared the data would have forgotten to mention—that would derail the whole finding? I doubt it and that’s because of:

1) The person most likely to benefit from you posting your data with metadata is you. Future you. Because you will not remember all the details of how it was collected and which day someone turned off the automatic watering on your experiment and dried out your plants in a year, or two or ten.

So what are the odds you will be so good at helping others use your data in the future?

2) I don’t know so many cases where people screw it up. And by that I mean: I don’t know any cases. If you know of a good case tell me. And if you’re someone who is being told to worry about this, ask for the good examples of this problem. There must be a name—beyond the boogeyman—for this tendency by those opposed to data sharing to toss out the risks of data sharing that they themselves have never observed. It’s just out there in the ether, being so scary, like a camp story someone told them, that someone else told them about two campers … who were waiting for a friend … at night … in a dark parking lot…. Boo!

My other point (point 2) sort of relates to this boogeyman problem, which is the belief that data should be protected from ‘the public.’ This was a new one I did not know so many people were into until recently and so I am not (yet?) sure how to combat it (I am open to ideas). It seems pretty obvious to me that data generally thrives in the sunlight.

Open data is a major tenet of the open government movement that many democracies have signed onto (and, must I say this?, open data is not a major policy of autocracies) based on the reality that open data makes discriminatory policies obvious (e.g., redlining). There are realities where public government data has been misused to harm minority groups, but most often the reverse is true. Closed data makes it easier for governments to unequally distribute resources and hide other nefarious actions. So I am nervous about how many people seem quick to invoke minority rights and concerns to back up government policies to keep data closed—with zero evidence to support their supposition (is there a term for this? There should be). Shouldn’t our first reaction be that data should be open without good evidence to the contrary?

Post navigation

Post navigation

Similar Posts