Reducing Barriers Through Open Data with MIT’s Dr. Alex Shalek

Dr. Alex Shalek is the J. W. Kieckhefer Professor in the Institute for Medical Engineering & Science and the Department of Chemistry at The Massachusetts Institute of Technology (MIT). This article includes the edited transcript of Dr. Shalek’s Lab Coats & Life™ Podcast episode with ƽ’s Director of Brand and Scientific Communications, Dr. Nicole Quinn, and co-host from , Dr. Jason Goldsmith. In the episode, the three discussed ownership, reproducibility, and privacy considerations when sharing data openly.
Sharing data can accelerate discovery, promote collaboration, and improve reproducibility, but it also raises complex questions around consent, privacy, equity, and career development. For life scientists, deciding when and how to share data is no longer just a technical or logistical issue; it's an ethical and strategic decision that can affect everything from grant applications to global research partnerships.
Dr. Alex Shalek and members of his lab strive to engineer and create technologies and methods achievable by any lab, anywhere in the world through open source experimental protocols and computational packages. In this episode, he speaks candidly about the promises and responsibilities of open data, and why simply “sharing everything” isn’t always the right answer. Drawing on his experiences with global research collaborations, Dr. Shalek offers a grounded perspective on what it means to share data responsibly and shares his thoughts on how the open science movement must evolve to support both scientific progress and the people behind it. Whether you're an academic, industry scientist, or early-career researcher, this episode invites you to rethink what “open” should really mean in modern science.
Podcast published May 2024.
The following interview has been edited for clarity and brevity. The views expressed in this interview are those of the individuals involved and do not necessarily reflect the views of ƽ Technologies.
The Promise and Challenges of Open Data
Nicole Quinn (NQ): Before we start, I will note that we're specifically not talking about open-access publishing or sharing final polished stories. We're talking about sharing raw data, raw methods, and doing that on openly accessible platforms.
I was fascinated, intrigued, and humbled by Alex Shalek’s insights and his eloquent way of framing some of the opportunities and challenges surrounding open data at the [2023] International Union of Immunological Societies (IUIS) meeting in Cape Town, South Africa, where we discussed open science.
Can you speak to your specific experiences with openly sharing data and how you're currently working within this open data sharing world?
Alex Shalek (AS): One of the things that's so exciting about genomics is the promise that it enables you to explore many different hypotheses. More than thinking about the things that you can do in your own experiment, it's the idea that as you bring together datasets across multiple different experiments, there will be opportunities for reanalysis that will enable emerging insights. It's one of those things where we always think that as we get more and more data, there's going to be real opportunities to take advantage of that to push the boundaries of what we know, to learn new things about how cells work, about how our tissues work, about what drives disease, and incredible opportunities for technology development, whether it's understanding what we need experimentally or building new computational approaches. I'd say that the recognition of the importance of data also underscores some of the complexities associated with it because many people are interested in having access to data, and many people have differences in levels of access to the data. In many places, when we originally envisioned the studies and got consents from them, particularly if we're thinking about human data, some of the use cases weren't always envisioned in the way in which we might want to use them now.
It opens up lots of questions about how to best do open access data and how to do the kinds of things that we all can see as being potentially incredibly powerful, but at the same time recognizing that, fundamentally, we owe respect to the individuals who partnered with us in these studies by sharing samples and sharing material at a time when things may have been very precarious or very difficult for them personally. So there's this tension between the promise and the responsibilities that we have as investigators, but the responsibility that we also have transitively as people that are getting involved in reanalyzing those samples and making sure that we do as well as we can by those individuals in their wishes.
Industry vs. Academia: Different Perspectives on Open Data
What are the different thoughts and considerations around open data in industry?
NQ: Jason, where are you sitting? You're coming at this from an industry perspective, whereas Alex is working as an academic.
Jason Goldsmith (JG): So I'll say the mercenary answer is that industry loves it if we don't have to provide it. We see this in academic papers all the time. “Hey, I have this new interesting finding, and I went to a dataset of human samples and saw it there.” That, I think, is the gold standard example Alex is getting at of new discoveries from old data. You don't have to go run a new trial. If the consent's all clean, there's a lot there. But by the same token, when private industry becomes involved, there are other concerns. Private industry spends millions of dollars of investor income or public stockholder income, i.e., people's money, to generate these datasets. So they have an interest in seeing them get the benefits of that. They usually share the data at some point after the trial is out and there's publishing—but it could have a decade-plus embargo on it because they want to mine all that information and be the beneficiaries of it. They actually have a responsibility [to do that] because of how the investment structure works.
So it creates this interesting dichotomy where industry broadly loves public data that's available—taxpayer-paid data—and that data is available for everyone to use. Companies are taxpayers too. But then they have an interest in not sharing their private datasets. I think COVID is an example to the contrary. But even then, the federal government backstopped the financial gain mechanism. They said, "Hey, if you can get a vaccine out or two vaccines or three, we'll buy them all. You just get them approved." And so everyone said, "We'll share data with each other to get these all approved." So I think that's part of the complexity from an industry perspective. If you put in a patent or you submit a paper, you put in a copyright [application], anything like that, it's now freely available. Whereas if you keep it secret, it's a trade secret and not available, then that has a strategic advantage to companies.
What are your perspectives from academia, Alex?
AS: I would say that we're looking at it very similarly. It's just we have different groups of people to whom we're responsible. Some of the things that you do in terms of setting up your cohorts, thinking about consents, working through it, there are different things that you're committing to those communities as part of that process. Academics often don't have the same ability to translate insights into products that can impact people in the way in which you want. As much as I'd love to be the person that develops a drug, brings it to market, and delivers it to an individual to really alleviate suffering, in many places, that's not what the academic model is made to do. It's made to find those targets, to find maybe a lead compound, and then to move it out to industry. I'd say that we have different responsibilities to people and, therefore, we can think about data in different ways.
I'd say that we [academia and industry] have different responsibilities to people and, therefore, we can think about data in different ways.
Dr. Alex Shalek
But I would say that there's also an importance in having open data. This came up in the meeting that we had at IUIS, which is that in many places, when we think about doing a study or repeating a study or setting it up again, A) it's not a very efficient use of money and B) it's not a very efficient use of time. There are tons of hidden variables in the way each of us thinks about structuring our studies and building stuff out where there can be biases or confounders of which we're not totally aware.It shows up in a lot of places where people think about the gut microbiome in different mouse colonies leading to different results. But I think that sometimes it's nice to have completely different studies structured by different people with different datasets, where you can see glimpses of hypotheses. Or the ideas that give you that sanity check and that tell you you're on the right path and you can move forward. So I think that there's an independent value to open data, which is that it gives us the opportunity to “sniff test” or to really fact-check some of the things that come out of what we're doing.
The Complex Ethics and Economics of Open Data Sharing
What makes open data sharing so ethically and economically complex?
JG: I think the difficulty is in aligning those incentives properly. So this absolutely happens where in industry groups, two big companies are working on the same pathway and some of their people decide they should share data with each other and enter into an agreement or a CDA [confidential disclosure agreement] and then share that because they see benefit in all those things you just said. But a little company, like the one I work for for example, may not want to share that because if they share it with the big company, the big company is going to outspend them and essentially jump eight steps. What the little company wants to do is sell them that data, so to speak. So that's where it gets very tough because all the benefits are there. But if the company spent $10 million of investor money to create the data, they want to see that $10 million back. Otherwise, it doesn't exist anymore.
AS: I’d want to see more than $10 million back. As the son of an IP [intellectual property] attorney, I will sit here and say I totally understand. The biggest question is, what do you cover versus what is trade secret? What do you keep internal? And where do you really get the most bang for your buck out of all the things that you put in? Because that's critical. And we [academia and industry] get to speak from different positions and approach things in different ways. I think we both see the utility in open data. I think that we just have to approach our engagement with it in different ways. So to your point, I might think about being a more proactive generator of open data and open sharing of data. But the thing that you're highlighting is your responsibility to many of the people in your company or fund companies. I would say that when I generate open data, I also have to think about who I partner with because those individuals might have different desires and wishes. [For example,] when I partner with individuals that are in the global south and that have less access to some of the computational infrastructure or don't have the same resources and know how to move quickly.
Is immediate open access always fair, or should we consider equity in timing and access to data?
AS: Or do we provide protected periods over which they can engage with the data? How do we make sure the researchers working on that actually generate the greatest return on that? And how do we make sure that that return accrues back to the community? As you think through these things, open data is great. But at the same time, we have to think about who owns that data, and that's the people involved in providing the samples that generate the data, but also the individuals involved in generating that data. What are the responsible things to do with it? We could take this approach that says, well, if we took the samples and brought them to Boston, where I'm sitting right now, and ran them and generated data, it would accrue data benefit to the communities that donated the samples or that collaborated with us as partners by contributing samples. But that might be a little disingenuous if we're not really engaging those communities and thinking about researchers in those communities and thinking about what local desires and wants are.
It's not fair for me to assume that every single person can be as open with their data and share it as freely, even when we're all working towards something that we agree is a major goal, like figuring out how to fight COVID-19 or tuberculosis or HIV.
Dr. Alex Shalek
So I completely understand what you're saying. We just have a different group of stakeholders. The incentives to open science aren't the same among all those groups. As much as I appreciate the strong push to share, and particularly from government organizations that I'm affiliated with, and as much as I believe in sharing, I recognize that I sit in a position of privilege and that I'm able to do that. Being careful sometimes to not put some of my students or my trainees at risk because they have careers and they have things that they need. But I can do so much more easily and much more freely than many people around the world. It's not fair for me to assume that every single person can be as open with their data and share it as freely, even when we're all working towards something that we agree is a major goal, like figuring out how to fight COVID-19, or tuberculosis, or HIV. So what's really important in understanding open data is understanding who's involved in the data in terms of the donation, of the generation of the data, the analysis of the data, and the responsibilities that go with that entire process.
I'd even say it's different. It's not just academia or industry. It involves really thinking comprehensively about who's implicated in what their desires and wishes are. Whether this is a responsible way to engage.
What’s the difference between reproducibility and scientific responsibility, and why does that distinction matter?
AS: We have to separate out the reproducibility piece from the responsibility piece because I see them as very different things. I can think about how to create reproducibility and what's required to create reproducibility and how open data can contribute to that. I think about that as being different from some of the other pieces that we're talking about. In many places, issues with reproducibility come down to the fact that we very often don't methodically or systematically characterize a lot of the things that go into experiments. We don't necessarily pay attention to them. They're the hidden variables. So they're things that we just take for granted. But we don't realize that other investigators don't take the same things for granted or some of the things that we think of as being unimportant turn out to be important when you look in different contexts. I think that a lot of figuring out the reproducibility piece is finding a better language, a better way of communicating. One of the big things that people talk a lot about when they talk about open data is this idea of creating better standards, better metadata, better references, better ways of annotating things to enable greater consistency and sharing of information.
Why Full Transparency Is So Hard: Publishing Limits and Regulatory Lessons
Are publishing constraints getting in the way of transparent and reproducible science?
AS: I think that one of the things that I really dislike about publishing is how you have to really pack all of your methods and approaches into the smallest place possible with the fewest words possible because you're trying to get [your research] into a print format. I love that there are some journals that try to focus on methodological approaches where you can talk about some of these pieces more freely. But I think we need to do a better job of describing what we've done, how we've done it, why we've done it, and what the assumptions were behind it so that people can have those conversations. If you're not creating a format where people can easily look at that and add it, particularly as they're amalgamating the information from lots of different studies, there's a tremendous opportunity to find that things aren't reproducible for reasons that are totally explainable. So you could say, "Oh, this isn't reproducible because these people did this in serum-free media whereas these people did this in media with serum." And you would say, "No, it's entirely biologically predictable. The results actually reproduce biological features. They just don't reproduce the same results because they were done in different ways, in ways that make sense."
How can open data standards improve the reproducibility and sharing of scientific methods?
AS: As for reproducibility, I think open data can be important in creating standardization of descriptions of experimental approaches, of metadata, of ways of people going back and forth, of normalizing the idea of having conversations. Obviously, there's this point around trade secrets that Jason brought up, which comes up in some contexts even within academic labs, but you really need greater exchange.
As for reproducibility, I think open data can be important in creating standardization of descriptions of experimental approaches, of metadata, of ways of people going back and forth, of normalizing the idea of having conversations.
Dr. Alex Shalek
JG: I think reproducibility is the area that industry cares the most about in a couple of ways. I think one example that we all know about is clinical testing. If I get a blood count and it's at one hospital or another hospital or even in another country, often that’s on medical devices that go through really rigorous reproducibility [testing]. Obviously, every experiment by a scientist can't be that way. But regulators set pretty strict guidelines about how you have to describe your assay and the level of detail of it, whether that's for one of those clinical devices that becomes a medical device or to measure how much acetaminophen (paracetamol) is in your pill. That is a rigorously reproducible measurement. Industry will take that same technology for ibuprofen and for [the statin drug] Lipitor and anything else.
Maybe that’s a lesson learned from regulators on how to describe assays precisely and what standards are needed for robust reproducibility or at least the descriptions therein. You're not going to run 100 samples in academia, but you can describe it, better than we [industry] do. Is it serum-free media or is it with serum? Some things aren’t written down like, "Oh, you need to angle the thing at 45 degrees." But in pharma, that's all really regulated, from the temperature of the pipette and the angle of it and what brand. And that's how they get around that [reproducibility issue] because the FDA wants it to work every single time you measure the acetaminophen in your pill. There could be some lessons from there actually that could help academia and publication in general in terms of conveying that information, even if you don't have to do it to the same level.
How can we satisfy both scientific and regulatory standards?
AS: We're talking about things that could be similar or could be different. The idea of mass, it's consistent. We have a scale that we use to measure mass. We can be very reproducible in how we approach it. Temperature is something where we have a scale. I was teaching in the class earlier today with temperature scales. It's something where we can be very consistent. I think the problem is that when you think about biology and the kinds of big data we're talking about in genomics, you have lots of things that you cannot control. We engage with different groups of individuals as donors and partners based upon what study we're doing. Those people aren't identical. They don't have identical experiences. They don't have things that would enable you to reestablish the same system. Even if you think about it with mice or pick your favorite model system, it's hard to create something that is exactly identical in every instance so that you can get the same output over and over again in exactly the same way. So what becomes important is to understand what drives variability and to really use that to think about whether or not what you're seeing is consistent.
So I think what you really want to say when you say reproducible is, is the biological mechanism consistent? As opposed to, did we get exactly the same results? The entire way I got into single-cell genomics was because we took cells that were supposed to be identical. We hit them with exactly the same stimulus, and they all responded differently. We tried to take a system that was supposed to be as identical as was humanly possible, [using] postmitotic cells synchronized in a way, hit with a bacterial Armageddon, and they responded differently. Now, those differences in response gave us something to correlate against, which helped us figure out mechanisms that we could then go back and validate. But I think the point is that biology has some variability in it. As a physicist, I'm not going to sit here and tell you it's the quantumness of biology, but there is some variability and we want to understand whether or not that variability is a biological feature or a technical feature. I would separate out the reproducibility piece into technical reproducibility and biological reproducibility and 100% agree with you on the technical piece.
What’s stopping scientists from being more meticulous in documenting key variables in methods sections?
JG: How often in the methods do you see [details about] when samples were collected from donors? How many hours until it was frozen? They'll have some range. Samples were frozen within 85 hours. And you have to ask, "Well, what does that mean? Is it most of them within 24, or 70?"
AS: It's so important. I know from some of the stuff where we were involved in rapid autopsy work during the COVID-19 pandemic. The amount of time from sample collection to processing, or from when the individual unfortunately passed away from COVID to when the sample was collected, the temperatures that it was kept at, and all these things played a critical role in driving things. They're understandable. You can see why they do specific things and why you get specific results, but only in light of having that information. In many places, we aren't careful enough in writing it down. I think we have to create standards around that being an expectation because it leads to much better interpretability. I want us to think about reproducibility versus interpretability. Consistent interpretation, I think, is what we want to go for, particularly as we move to biology [biological reproducibility], as opposed to technical reproducibility, which is a must in order to get to a place where we have biological consistency.
Consistent interpretation, I think, is what we want to go for, particularly as we move to [biological reproducibility], as opposed to technical reproducibility, which is a must in order to get to a place where we have biological consistency.
Dr. Alex Shalek
JG: That's where I think regulators have done a little bit. They’ve said, "Hey, if you're going to submit this, you have to tell us the time you stored it at and the average and the mean." Because of the stakes on, for example, drug potency and manufacturing of sterile things, they have created a playbook of important variables. I think it's not going to translate 100%, but you could learn from it to an extent with what regulators have demanded of pharmaceutical assays just in terms of a playbook of “write it down, please.”
Setting Standards for Open Data
Who in academia is overseeing or running the open databases, and how are they being standardized and communicated?
AS: There is an incentive in industry for repeatability. You want to do no harm to the person who's going to take that acetaminophen or the aspirin. There are lots of repercussions to that not happening. So not only is it the right thing to do, but you wouldn't want to have something that was adverse for all the reasons in the universe. When you think about academia, there's a lot of incentive to do it within your own lab and to produce results that are very consistent and to think about how to do things in a way that provides validation. We like to use multiple different methods, multiple different approaches, multiple different systems; as many pieces of information as we possibly can get. But now, as you think about open data, where you're pulling together resources across labs, that's where I think it is really hard because the question is, who oversees that? Who convinces you that you should be adhering to specific standards? Or gets you to intellectually buy into the value of doing that? Particularly as when funding decreases or stays the same and inflation goes up, it becomes really hard to think about doing extra work and putting extra onus on people.
That’s the rub. A lot of it comes down to people who are big proponents and advocates of the possibilities of big data or people who want to do work that requires big data to push these things forward. It's a lot of convincing and getting people to the table and telling them they have to do this. It's hard to say this is what will incentivize people to adopt a common standard versus adopt best practices because I think everybody wants best practices. They want to do good science. They want to make sure that what they're putting out into the universe is high quality and it's not going to come back to them that something was wrong or they need to retract a paper because nobody ever wants that.
Open Data and Protecting Trainees
How can we balance trainee, PI, and community interests?
AS: It's critical. I think that this idea of sharing and when to share is really important because, as you say, I sit in a position of privilege. I'm a tenured professor. Every single paper is important to me, but it is not the piece that makes my career, that leads to my PhD, that lines me up for a postdoc, or might line me up for a job. I get to think about it slightly differently. In general, I want to get back to something that comes up a little bit earlier. Why do we do these things and what motivates us? Personally, I want to help people. I do this job because I want to help people. That means I want to find causes for illnesses. I push toward creating better therapies. I want to mentor and train people. That's why I'm in academia, right? And so I have to think about who I'm trying to do good for and who I'm responsible for.
How can we balance trainee and PI interests?
AS: As we think about open data, early sharing can potentially put some of my trainees at a disadvantage because they don't have the time to do their analysis, write up their paper, and work through things. I think that what becomes critical in sharing early and sharing openly is to set expectations, to have conversations, and to put people in a place where they're comfortable with that and where you basically have created agreements that protect individuals. During COVID-19, we had all these partnerships. We worked with people all around the globe, and we wanted to show for multiple different species, multiple different tissues, multiple different contexts, what cells were most likely to express ACE2 and TMPRSS2, which are likely targets of SARS-CoV-2 in COVID-19. I had to have a series of conversations with the editors saying I can't share all of this data because these people haven't published. But what I can do is I can tell you all the cell types, and I can give you the expression of these two genes or this set of genes, all the potential entry factors plus ACE2, and tell you what the cell types are, and maybe give you a marker or two.
I think that what becomes critical in sharing early and sharing openly is to set expectations, to have conversations, to put people in a place where they're comfortable with that and where you basically have created agreements that protect individuals.
Dr. Alex Shalek
But there are times where we had to think about what we were going to give and what we were going to hold back. In other places, we can think about sharing data early, but we can have agreements. We can set things up in the same way that companies would set up NDAs [non-disclosure agreements], where we agree that we'll share this early, but you'll wait for us to put our paper up on bioRxiv and then cite it. But what I've learned is that if we're very careful and thoughtful and we have the hard conversations early, it puts us in a better position where we can take care of the people to whom we're beholden. It's not a bad thing for me to say, "Look, I trust you, but I need to say this to you because I represent this trainee. And this trainee has put a lot of trust and a lot of faith in me. And it is my job to respect that trainee and to have this conversation so we're on the same page ahead of time." I've just learned to be very proactive in those conversations and to recognize that every time I set up a collaboration or a partnership, I need to understand what the other PI wants, what the other student wants, and what my student wants.
Balancing Data Benefit Across Communities
How does open data help broader communities?
AS: There are places where some things work for me, but they don't necessarily work for everybody involved. [In those instances,] you have to just have the maturity to say, "This is a place where we have to step away." So this gets back to the larger conversation about open data more broadly. There are a lot of people that push for open data, but when we think about communities and about those that are involved in generating and using data, we have to take the time to think about data benefit. We need to really think about who the data benefits and how. I like to think a lot about the communities that are impacted by whatever we're studying. That's not just the people within science. It could be the trainees, the PIs [or graduate students] that need those papers to get grants or to graduate, but also the individuals that might be impacted by a specific disease and what they hope and what they want. At the same time, you have to balance who in our lab can do it, what are the things that they need, and what stage of their career they're at.
I think being careful and having very direct conversations about it becomes important. I'm not going to say that I was always as thoughtful about this, but I've learned along the way to be very proactive and to be very careful to make sure that I understand what people want and to constantly check in to make sure that I really do understand and it hasn't evolved in a way that's going to become problematic.
How can we help trainees and early-career scientists develop the soft skills and frameworks needed to navigate this human side of science?
AS: Podcasts like this are a great way of starting those conversations. When I talk to my trainees, I always say things like, "The things that got me to this position are not the things that positioned me for success in this position." It was what I was able to accomplish with others in science and because of some of the ideas I had. But now I run a little company, I have to think about my trainees, not employees, because really I work for them. I have to think about the people with whom I work and what they need. So there are things that I wish I had done earlier, like leadership training. I push my students, and whenever there's an opportunity, I try to get them involved. There are other pieces of the puzzle. I think that making that part of the graduate curriculum [is vital]; creating all those soft skills, things that become critical regardless of whether you're going to academic science or to industrial science or completely leaving science to do something different.
I think we have to make that more part of the practice because I went in naive with best intentions. But the more I learned, [the more] I recognized there were things that I wasn't paying attention to or things that I wasn't fully thinking through or that there were situations that I couldn't anticipate. The real goal in all these things is to make sure that you have the impact you want and taking the time to think about whether your intentions and your impact align. Finding why there's a discrepancy and working towards minimizing that is critical. But some of these larger skills, we have to build them into the education of people because it becomes so important to their success and so important to these paths that are the more human side of the business of science.
Bridging the Gap: From Bench Skills to Leadership Skills
How can we better prepare scientists for the collaborative, people-focused side of research?
JG: I’d say that what makes you a good grad student, or a good postdoc, or good early PI is not what makes you successful long term, whether it's academia or industry necessarily. Your critical thinking skills and your ability to break down problems are there. But there's this big step between individual contributor and leading a group of people that occurs everywhere. I think it's a problem in science generally. It happens in industry and academia. Science is not particularly good at building that natural “on-ramp” to go from an individual contributor to a leader of a team; whether that's postdoc to junior faculty, to senior faculty, or it's scientist I, scientist II, to group leader in industry. Understanding the collaborations you set up and how to negotiate with people is one of those group leader things that's a lot more helpful if you get experience early.
I was lucky in my postdoc as I got to set up collaborations and deal with data sharing. And [ask the questions], "Okay, how are we going to publish this? And who's doing what?" So when it came to industry, I knew what was a trade secret and what wasn't. I had a template for how to go about it. But I also didn't have training on how to negotiate trade secret negotiations with academic labs and what we can share and what we can't. I had to go talk to legal for a little bit and figure it out. I think the larger point that Alex makes about that learning curve and soft skills is something that all of science would benefit from. If there was stronger [soft skill] development for undergrads, grad students, postdocs, early career industry scientists, or early career academics. Science is such an intellectual individual contributor pursuit early on, then you have a giant phase shift to the later phases where the natural on-ramp [isn’t there] that sometimes exists for other career paths.
I think the larger point that Alex makes about that learning curve and the soft skills is something that all of science would benefit from. If there was stronger [soft skill] development for undergrads, grad students, postdocs, early career industry scientists, or early career academics.
Dr. Jason Goldsmith
AS: The thing to recognize is that different labs, working on different problems or of different sizes, have different interactions and connections to industry, to partners locally, across the state, across the country, across the globe. It creates different opportunities. You'd like to see some greater normalization and sharing of those learnings. People are going to have different opportunities to practice it, but obviously, people have different goals, and so you don't want to assume that everybody is going to want [to learn] every piece [about career growth], but you do want to have exposure. You were asking me at the beginning why do I have a Bachelor of Arts instead of a Bachelor of Science? It's because I went to a college to get a broad arts education because I wanted broader exposure to things. There are different pieces that I have different familiarity with and competency in, just because of how much I've used them. But you want to at least make sure that people have some of those tools because once you have them, particularly some frameworks for picking up some of these larger problems, it becomes much easier to work your way through stuff.
Privacy and Consent in Open Data
What are some of the frameworks in place, and some of the gaps, around data privacy?
AS: There's a lot going on, and I think it really depends on who you want to include in this. Are we talking about open data with respect to sharing things within [for example,] Boston, within the United States, or around the globe? Are we thinking about who is getting involved as a collaborator and a donator of samples? Even more than that, thinking about the communication of consent; what do people really understand about consent? Is the language interpretable in a way that they can really understand how the data will be used and reused? There's a lot to unpack there and a lot to work through. In many places, it's easy to ignore it and to move forward because you could say the hard part is the science. But I think it's more important in a lot of ways to spend the time in the beginning thinking through these things and setting up structures that last.
Getting back to the mentorship piece, I often tell people it's more important that you learn how to do science than that you accomplish specific things in science. A PhD is really learning about how to approach the process and how to show up in the world to make sure that you are [being] the person and doing the work that you want to do more than guaranteeing any result because those things are hard. It's a big thing to wrap your head around. When we try and think about it through one specific lens, it becomes problematic because there are differences that are readily apparent when you have conversations with people in different parts of the world. Instead of trying to solve things with generalities, we really need to lean in and think about how to partner with people. To create structures that work for them becomes the most important part of all of this. You might say, "Well, it slows down the big part of open data. It doesn't let us do the mining on everything." I would say, getting back to this idea of technical reproducibility versus biological interpretability, there are a lot of places where understanding how people did things, understanding the lens they apply to science, using that through the kinds of things you do towards interpretation and analysis can be incredibly empowering. I think that building those partnerships builds the scientific community. It helps to advance people. It helps to identify opportunities when you pull datasets together.
I think that open data is not easy. I think the process of working towards open data is almost more important in some ways than achieving open data because the process of doing it and building the community and doing things right, getting people engaged, thinking about process, thinking about how to really support and enable people and empower them—to me, [that] is the fulfillment of what we're really trying to do, as opposed to the thing that happens in the future.
How does this work for anonymized personal information, such as for clinical trial data?
JG: I'm in charge of the privacy for our donor network and our company. I think with privacy and consent that gets very interesting. You made a point early on about sometimes there's new things that we want to do with data that weren't even envisioned when the consent on the trial was obtained years ago. We look at the historical nature of consent. Frankly, it used to be terrible. We [used to say], "Oh, sign here. Don't explain it. You're good. We can do whatever we want forever." So now we have a different reaction where you have to describe in detail every single thing you could possibly do now or in the future with the data and have an affirmative approval for that. If you didn't list a type of test that didn't exist yet, you can never use the data for that in the future. You get this really weird thing where you try to write broadly, but it could be so broad that it's meaningless. But then that gets thrown out by IRBs [Institutional Review Boards, which work to ensure the ethical and safe conduct of clinical trials involving human participants] as not being specific enough. But then [if you’re too specific,] you could preclude the data from being open.
I wonder if part of the work may be going back to standards. But [working toward] a more unified understanding of what true informed consent is and privacy protection. But putting privacy a bit aside and anonymizing the data. Start with that. Obviously, there are some ways to back-calculate out the lack of anonymity. But, assuming it's anonymized data, to get that consent to allow the data to be used, knowing that five years from now, someone's going to want to use that genetic sequence for something that no one envisioned could exist now, and not completely losing access to all this data we had up to date because it wasn't listed the right way on the informed consent at the time. That's like a landmark event that needs to be aligned on.
AS: It's one of those things where if you talk to people in other parts of the world, they would say the fact that you're even sharing genomic sequences; that is something that you can't fully anonymize. Obviously, there are places where we could think about recontacting people. You could have consent to recontact, right? It becomes complicated. Sometimes an individual, unfortunately, passes away. Then you have to think about whether you have the ability to recontact the family. You also have to think about these larger questions of consent of the individual versus the community. In many of the conversations we've had with indigenous communities, one individual may consent, but the way in which the information may be used may have an impact on the broader community. You start to ask, “How do you do right not just by that individual but by the broader community that they represent?” That's where really thinking about consent becomes critical. But I would have argued that a lot of this comes back to the point that you made at the end about creating greater awareness and understanding of what consent is and what people are consenting for.
Final Words of Advice
Any final advice for scientists about to start sharing open data?
AS: It comes back to being involved in greater partnership with these communities, having conversations, going back and forth, relaying what the science is teaching you, talking to them about what you're doing so that you have a common base for dialogue. You could say it slows things down because it's impossible to do all of that, but it becomes important to achieve the overall goal of what we're talking about here.
You can find Dr. Alex Shalek on X
Don't forget to sign up to our email list at
To get show notes, episode summaries, and links to useful information, or to learn more about STEM mentorship, see www.stemcell.com/labcoatsandlife.
You can also reach out to us via our X account, , or via email at info@labcoatsandlife.com.
Have guest suggestions? Let us know!
Additional Resources
Request Pricing
Thank you for your interest in this product. Please provide us with your contact information and your local representative will contact you with a customized quote. Where appropriate, they can also assist you with a(n):
Estimated delivery time for your area
Product sample or exclusive offer
In-lab demonstration