Nikesh Gosalia and Mark Hahnel continue their discussion on making scientific data public, talking about the idea of publishing negative data and the unequal explosion of gold vs. green open access. Mark makes a strong case for keeping all data openly available in time-sensitive research fields such as COVID-19. He also discusses how open access could propel research to the next level: if all research data is available online, AI technologies could sift through it and detect trends that humans can’t. Keeping with the theme of technology, Mark touches on the role of blockchain in decentralized sciences and Figshare’s focus on building policy-compliant infrastructure to make resilient, long-lasting technology. He also shares his thoughts on collaborative efforts between humans and AI, citing the Google DeepMind project AlphaFold as an example of a successful project. Mark talks about some open science initiatives that have caught his attention, with a special focus on Cassyni, a company that uses technology to improve the impact of lectures. To end this episode, Mark shares his opinion on the factors that are preventing the mass adoption of open science.
Mark Hahnel is the Founder and CEO of Figshare, the all-in-one repository for papers, FAIR data, and nontraditional research outputs. He is passionate about open science and its potential to bring positive change to the research community. Mark has acted as an advisor for the Springer Nature master classes and is currently on the advisory board for the Directory of Open Access Journals (DOAJ). He can be reached on Twitter or LinkedIn.
Insights Xchange is a fortnightly podcast brought to you by Cactus Communications (CACTUS). Follow us:
This was the point where my mum finally understood what I did. We need to have common metrics so that in future, we can give researchers credit for all of their open outputs. The academic publishing setup and reward setup is very complicated. There's lots of things that are confusing about it. Thank you for joining us again. This is a continuation of the previous episode. As always, here's your host, Nikesh Gosalia. Why do you feel opening the scientific data to the world is so important, especially what we maybe term it as negative data? Looking back on the last 10 years, I think, two areas where I think there hasn't been as much progress as I'd like, or where there has been fantastic progress elsewhere, is if you see the growth of green versus gold open access, 10 years ago, there was about the same amount of green open access, making a free copy available somewhere of your paper, versus gold open access, paying for $1200 or $12,000 fee. It's definitely been an explosion of gold open access over, I think, it's two-and-a-half times as much gold open access versus green open access now. That's frustrating, because it's not perfect in the equitable future world. But I'd say, hey, I have access to a lot more of the content now which I wouldn't do otherwise, because I am not affiliated with the university. There are some compromises. I think the idea of negative results getting published is a huge inefficiency in research. That's one area that I don't think has come as far as it could do. That's because I think the data credit system. The credit for all of your research outputs is, it works to a level, but at the funder level, it's not quite there yet. It's really encouraging to see that it is happening. The National Institutes of Health are funding repositories to start interoperating with each other and promoting things around their new policy. And so, I am the co-chair of one of their working groups, which is the open metrics for data publishing group. What's great about that is they are saying, we need to have common metrics so that in future, we can give researchers credit for all of their open outputs, because it's very hard to normalize everything. Previously, we had one type of content, the paper, and the paper will always remain the king or the queen. It's the context, it's the story of what you did. And then, we have one or two measures of impacts. The citation count, potentially, where it's published, which I'd like to move away from, but such is life. You can't pretend that things don't happen, right? You can't pretend that there isn't... I love that Dora and things like this. I love what Randy Schekman did when he won the Nobel Prize about saying, let's not all just focus on publishing in Nature, but at the same time, pretending it's not a thing, just makes life harder for everybody working in this space. And so, negative data is something that I think there is going to be more credit for in the future. But the reasons why it's important, right? I touched on COVID, right? This was the point where my mum finally understood what I did. If it wasn't hard enough for stem cell biology back in the day, she just used to tell her friends that I worked on the internet, because open academic data publishing is a bit niche, if you are not in the space with the publication of datasets associated with COVID. I think the worst term in any paper is data available upon request. There was recently a study that showed that if you actually chase those people up, I think it was 7% that got back to you. I like the idea of emailing. It's easy to do now. Finding everyone who's got data available upon request in their papers, emailing them and ask for their data. If they say no or don't get back to you, then you can just go to the publisher and say, hey, can you retract that paper, they are not doing what they said they would. But that person would have a death wish probably knowing the current academic battle that's going on. But with COVID, it was easy. We are trying to get to a vaccine. If people make their data openly available, so that people can build on the research that's gone before, we can get there quicker. If this time axis gets longer, then more people die. The ways we can make this time axis shorter is to have all the content openly available, all the papers openly available, all the data openly available. The researchers don't need to send an email to somebody saying, hey, can I have access to your data. Because all it takes to slow that down is, it doesn't even need to be an active dissent. It's just passive. They just don't reply to the email. Then, that time axis gets longer, more people die. If all of the research is openly available, then we can build on top of it and move further faster. With the idea that COVID was a one off and that, well, people understand that because it was an emergency. We have the Sustainable Development Goals. I think every one of them is an emergency, right? You've got climates, you've got poverty, you've got people who are going to be living underwater levels. All of those things should be an emergency that we are focusing on. The research community making all of their research data openly available. That's point one. It was a very long point one. I have a point two as well which is this idea of the machines doing all the work for us. Something that came about 10 years ago. There was a book published called The Fourth Paradigm in 2009, from Microsoft Academic Research, led by the late Dr. Jim Gray, and he had this idea of, if all of the papers and all of the data is online, then we can get to this next level of research. That's the computers doing all the grok work for us. Machine learning, AI, human brain processing power is never going to be able to look at every single paper, every single dataset to look for trends and patterns. But if you can get the robots to do this for you, then that combination of robot and human together can push us onto the next level of knowledge discovery. I think that's the coolest thing about what the next 10 years of academia looks like. Fantastic! This is really just thought provoking. What I have personally noticed over the last maybe three years, perhaps, is that the whole narrative around use of technology is definitely kind of changing. The people are more open to listening to ideas. People understand that technology is here to augment what we do and not really replace, like you correctly mentioned. But just from your perspective, Mark, I mean, what have you been hearing? What have you been experiencing as far as you know, use of AI and machine language and perhaps, I mean I don't know, blockchain is concerned? Then, just very specific to Figshare, perhaps, I mean what kind of technology is involved in delivering Figshare products? What kind of development has been happening? Yes, it's interesting as well, right, because you have the different stages of company growth. My developers will hate me saying this, but when I talk about what we do now, we build policy compliant infrastructure. and the cool stuff, the open research, the open access, the cool stuff that can build on top of it, is a secondary to, we have to play the game with the organizations that exist. If we are working with universities in South Africa, the data needs to live in South Africa. Regardless of the point at which it's published, it's available for everybody anywhere to download. Fine, but the government has a rule that the data needs to live in South Africa for very good reasons over the past. We need to make sure that that's happening. What we have built is, and this is true of Australia, in Europe, certain universities can't use American-owned companies. As our default cloud layer, we use AWS, Amazon Web Services. You can't use that for some European universities, so we use the German version called Hetzner. We have a multi-site version of Figshare built all over the world which is a cloud based, rolling update. We are adding new functionality all the time. It's always with that ISO certified compliant. This is going to be around a long time. This is resilient. There was a story about a famous researcher whose thesis got made available on a university platform. The headline was, it was so exciting, it crashed the platform. I understand that as a new story, but I would be devastated. It's like, well, there's more people looking to take an interest in something that's really niche and really interesting. I think a good thing for humanity to be looking at. The fact that lots of people couldn't have access because technology fell over is, it's for me not good enough. And so, it seems boring on that level. But it's very important in kind of feeding this next level of stuff, of open content for things to be built on. In terms of some of the things you mentioned there, for me, the most interesting role for block chain in DeSci, as it's called, decentralized sciences, is all about this access control. I think there's loads of cool-I think we'll see a lot of innovation in that space, PhD certificates, Master's certificates minted on blockchain, so you can carry them around with you, a lot more trust, a lot more authenticity in the space. I do think that in terms of machine learning and AI, very close to where we live, is this idea that the human and the machine are better than the human alone or the machine alone. We see this with chess, right? The best players of chess in the world are the human with the machine that will beat the machine, that will beat the human. That's why we have seen it in the news recently about are people cheating with machines in the chess world. I think that's true about academia. It's been amazing! The biggest revelation for me in the last Google DeepMind project called AlphaFold. This is the AI company that originally got famous for beating GO, which is some very complicated board game that I have never played before, so I couldn't possibly tell you how it's won. But then they took on protein structure folding and this idea that you can predict a protein structure based on a text string. It was just an unsolvable problem, in 50 years of research, very expensive crystallography research. We've got to something like 17% of the proteins involved in the human body, and overnight. Not "overnight," but overnight, AlphaFold came out and said they'd got to 99% of the structures. And it blew my mind. I remember thinking at the time this is going to win the Nobel Prize. This is a real step change in the way that, as I mentioned, Jim Gray, and The Fourth Paradigm was thinking about before. In the year since then, they had one million structures. They have now got 200 million structures. It's really just insane what's happened in that space. I actually emailed the lead author on the paper, a guy called John Jumper to say, we are working in the data space, you have got some very specific protein data. We really care about improving metadata because we have such a massive heterogeneous data. We want to make sure the metadata is better. You mentioned FAIR data at the beginning, that F of fair, findable data. Because a lot of researchers are new to the space and just call their data set, dataset. How do we get the metadata good enough so that AI will one day eat all of the machines-- or eat all of the data and spit out new things? Asking what specific schemes to work on. They were just like we are working on everything. We need to move the space forward. It also raises a slight problem. I don't know if it's a problem. I think it's a conversation to be had. I saw a graph recently from a report called the State of AI. We do The State of Open Data every year, so I was reading the State of AI. They said that in 1960, all AI and machine learning algorithm innovation was coming from academia. Now, it's basically zero. It's a rounding error compared to the innovation that's happening in the private sector and working in the private sector myself. I can't be calling people out. But I think it is something to think about, given how important is going to be in the next 50 years of humanity. Just aside from the work that you have been doing, Mark, and things that you have been following, do you know of any other open science initiatives that excite you? In terms of open research, I think about it as this collaborative landscape and how that's moved along in the last decade and what have you. We were talking offline just beforehand as well about a sister company of ours at Digital Science and Overleaf, and how explosive their growth has been in just better collaboration tools for researchers. I know it's not all open. But I think that open way of working just really helps set people's ideas from the beginning that we are openly collaborating on this. People openly collaborate on codes using GitHub and GitLab and Bitbucket. I think those tools are really important. One thing I have seen in the last few years as well, again COVID references this idea that everything's moving to more online workflows. It's been a great movement in terms of innovation because things like this are much more accessible. Now, everybody knows how to use webcams and all of this. What I have seen really interesting recently is a company called Cassyni, which is a company that looks at lecture series. Again, it's another kind of uncaptured output of research. We have this thing at Figshare, where people make available their keynote slides where they had 100 people in the audience, and they get 10,000 downloads. It's exploding the impact of your research and who gets to see that. I think Lecture Series, and presentations in general, are ripe for innovation. What Cassyni has done very well is, what you were saying before, making use of technology to improve the space, not replacing anyone's jobs, just doing innovative stuff. They have this engine for throwing a video and some presentations, and it pulls out all of the references. It just academizes it, in the same way that Figshare will make sure it's persistent and makes sure it's got a DOI. It's a few little tweaks that YouTube is never going to do which is why you shouldn't put your videos on YouTube. GitHub's never going to put DOIs on every item that they have because that's not their core model. The tweaks that we have done, I think, Cassyni is doing some really interesting things in terms of making content more accessible to more people. I think just having lectures available in a machine-readable way means that knowledge discovery becomes a lot more interesting. Why do you think some institutions are slower to adopting open science model? What's preventing, say, mass adoption, in your opinion, Mark? I mentioned before about different countries having different rules. I also mentioned this report we just brought out called The State of Open Data. We have done it for the last seven years. It's a survey of researchers, thoughts, and trends around what people do around open data and open science. The good thing about it is, they are kind of being hit from many different directions. Every Spring and Nature journal has an open data policy now. If you are going to get published in Nature, and they say, hey, you need to make your data set available, you will make your data set available to get published in Nature. If your university has a policy, if your funder has a policy. These things all help. But not every country can move at the same rate, and not every institution has the support available. I think another big problem is this equity problem of ivory tower universities. We work with universities. They get their data repository powered by Figshare. They have 10 people in the university working on it. Five of them are checking the files. Five of them are promoting and explaining to researchers how you use it and what licensing is. We have some universities who have a data repository because there's a national mandate, but they have one person who can work 20 minutes on it every week, in helping people, in answering support tickets and things like this. It's really that balance of where this funding is and where the support is. I think there's two things with why people would be negative towards it, or why people haven't caught up with it. We see China has an open access and open data policy. North America does as well. Fifty-two funders around the world have an open data policy that you have to make your data available. Nearly every private funder of research has this as well. Some countries are just slower to move on certain things because of other priorities. Some universities are too because there's a lot to do around that library. It could be working on engaging with researchers around the ORCID or other priorities. I think that the future is already here. It's just on an evenly distributed kind of mentality. This is what I was saying Overleaf and working collaboratively and openly changes the space in just moving things along. We don't have to wait for people to die for the space to move along which was the old kind of mentality of things, because the academic publishing setup and reward setup is very complicated. There's lots of things that are confusing about it. As I said, when I was a researcher, I had no idea what open access or closed access meant, right? I just wanted to publish papers, because that's what I was told to do in order to advance my career. And so, I think there's a lot of people who are still on that mentality of, yes, open access and open science and open research is cool. If you ask me about it, I'll tell you, yes, that's good for humanity. Let's do that. Do you do it though? Well, no, I haven't got the time, or I have got a mortgage to pay, so I need to publish in the best journals. The only way I can do that is to publish closed access da, da, da, da. I think there's still the legacy of old academic dissemination of content that is a hindrance to the mentality shift and the culture shift for academics around the world. Sometimes it's naivety and sometimes it's almost maliciousness. Absolutely. That's all the time we have today. Catch more of this episode in the next part with your host, Nikesh Gosalia. See you soon.