Gareth O'Neill and Nikesh Gosalia continue their discussion on all things open science, starting with the case of Diederik Stapel, a cautionary tale on the dangers of closed data. Gareth shares his insider knowledge on upcoming open science developments, including FAIR, European Open Science Cloud (EOSC), and the use of AI algorithms to find dataset correlations, as well as his observations on how stakeholders and researchers are responding to these developments. He also talks about solutions for current issues that are affecting datasets such as data stewards and a revamped reward system for researchers. Focusing on the future, Gareth discusses why open science will take decades to be widely accepted. He addresses the possibility of a tipping point for its adoption and talks about future considerations as AI technologies improve and data singularity occurs.
Gareth O'Neill is the Principal Consultant on Open Science at the Technopolis Group and a doctoral candidate for theoretical linguistics at Leiden University. As the Former President of the European Council of Doctoral Candidates and Junior Researchers, Gareth is a renowned expert on open science for the Dutch Government and the European Commission. Reach him on Twitter.
Insights Xchange is a fortnightly podcast brought to you by Cactus Communications (CACTUS). Follow us:
Thank you for joining us again. This is a continuation of the previous episode. As always, here's your host, Nikesh Gosalia. And you mentioned briefly that you will probably come back to FAIR. So, could you just explain that a little bit? To be honest, this is my own opinion, I think we are fighting over open access publications, which I think is ridiculous. We're fighting over getting access to reading the outputs of research, like that should be a dumb conversation. It's a no brainer that this should be open. We figure out the business model as we go. And we do that together with key players, such as the stakeholders who we hope will work together with us, right. But the future is data, the expression ‘data is the new oil.’ We have been moving steadily towards bigger and bigger datasets. And this is where this concept of FAIR comes in. So, ultimately, what you want is data open for access to researchers, right? Now, minimally, if I have a research article, I want the dataset that links to that article open to me, so I can check it. Otherwise, how can I check what they've written. This has proved a huge problem in many fields, not just psychology. There is a crisis in psychology recently, right, but it's in many fields. And an example here is I'll pick the Netherlands again, because I live there, a psychologist called Diederik Stapel, who was caught committing fraud at such a high level, it shocked the entire community. This person had been essentially making absolutely fraudulent spurious claims, which had been peer reviewed, accepted, and published in top journals. I think the number of retractions when this was found out was something like 55 articles pulled from the whole string of high-level journals, right. And this came to light not because somebody did their research on what the data said, it came to light because several of his Ph.D. students figured he was being fraudulent and put the finger out on him. And unfortunately, several of them suffered career wise. They lost their Ph.D.’s because it was based on fraudulent data. So, the community let this fraud and fraudulent activity take place. And when they looked deeper to see how did these articles get published, the answers were always the same. He's a top figure. Why would he lie, right? Oh, we didn't get the data. We assumed the data was correct, or we asked for the data, but we never got it. And in exceptional cases where data was given, it was clear the data had been doctored, it was perfect, like perfect data which you never get, right. And yet 55 articles got through peers. So, the point here for open science, and the point I am getting at is, we should be opening the data. Had the data been open or peer reviews even been open, we could have quickly checked to see is this true. Somebody would have checked at some point. And once you see one red flag, you will go looking for other red flags, but there was no even possibility to have a flag. You just essentially had a comment, a claim that was never backed up and yet supported by the peer review community, right. So, for just purely for reasons of openness, being able to check if we're doing what we're doing is correct, and to learn from what we're doing in data, we should be opening datasets, right. Now, this is where I need to draw a distinction between open data and FAIR data. So, open data is essentially any dataset that I open. It doesn't necessarily need to be in a specific format. And it doesn't need to be explained with metadata, right. So, metadata is descriptions of what that dataset is, or how it's structured, or what's in there. So, I can open a dataset that's completely, and the word we'll use here is ‘re-useless.’ It's open, I can locate it, but it's very hard for me to know what's in there and what I can do with it if I can do anything with it. Now what we want to move towards is what's called, FAIR data. And this is data that is essentially machine actionable. Meaning, a machine can look at a dataset, it knows what's in there, it knows the structure, and then it can tell me what I can do with it, or it can tell another machine what we can do with it. So, these letters stand for F,‘findable.’ So, you have enough metadata in there that I am not a machine, an index can find that dataset so that when I type into Google or any data search, it can see it, right. If it's not findable, you type into Google,‘linguistic research,’ you won’t find it. It's somewhere on a server, but it's not data for a machine to find it, right. A, it's ‘accessible.’ Meaning I know the level of access that I can get. So, I could have a FAIR dataset that's completely closed, I can't access it as a human, but I at least know I can't access it. I know the reasons why I can't access it. And in many cases I can request access, or there can be multiple levels of access where it could be accessible to a community or a field, or it just can be completely fully publicly accessible, right? In any case, the machine knows how it can get it.‘Interoperable,’ it means, and this is crucial for the future, interoperable means that the data links to other datasets. It's not an isolated dataset on some server that I can access and leave. I can actually somehow combine these datasets in linguistics and get a machine to somehow look across them and pull out, let's say, correlations or pull out information on this combined dataset. And the ‘R’ basically tells you what you can do with it in the sense of licensing, copyright, and how I can actually exploit that dataset, right. Now, the point here with FAIR data is very simple. Once it's made fair, it's out there, a machine can use it and find it so that I as a researcher can get access to the data. Again, as I said,it doesn't have to be open, it can be closed. But there are ways to access a closed dataset through algorithms. So, think of biomedical data which should never be open, at least much of the privacy data. You can still send algorithms into this closed dataset. They can look at the data, extract results, and I can go on and continue my research without ever having access, right. The future here, however, really is this I, ‘interoperability.’ Now, we're working on at the moment in Europe, and that's something I'm working on myself on an entity called the ‘European Open Science Cloud’ EOSC. And what this essentially is, is a web of FAIR data and services to exploit the data. And essentially, you can consider this as a new internet of data. So, imagine the internet on a simple level is basically first of all the agreement on how we share information, without that we can't communicate. So, it's an agreement, that's the Internet Protocol and other protocols, and then connected or hyperlinked text, right. So, the internet as a whole is all of these texts that we have out there. And if you will, the idea of the EOSC is to have a similar internet, but not of text or all the stuff you see on the internet, but of actual connected linked data. What that means is that this data, there is a standard in place, that's the FAIR protocol, if you will, that's how we will describe the metadata, so we could connect all the datasets. And the real idea here is that I can start, I can go to a search bar, if you will, not necessarily Google, I can look for a dataset and I can find a dataset in any discipline, or on any topic that's out there. That's the first step. The next step is I can then start using services like machine learning or, or in the future, artificial intelligence algorithms to start correlating and comparing data either because I direct it to say, I need to know about a comparison of linguistic data with biological data. Why do we make certain sounds, and it goes and looks. Or I put in open searches. So, let me give you a future vision of what this connected data could look like, right. I go to a search bar, I type in malaria. This is a really simple example. And I am picking an example from one of the architects from FAIR principles, Bernard Mons at the Leiden University Medical Center. He works on such infectious diseases. So, I take malaria. I type into my connected dataset malaria. The algorithms are running on the background and it has the entire datasets now to look across, right. So, it takes malaria. It sees that there are 100,000 publications linked to malaria because it's been tagged. And it sees in these 100,000 articles an assertion, an assertion is basically a linguistic expression of fact, right. And it sees in 95% of the cases, the assertion malaria is transmitted by mosquitoes. It sees a 5% malaria is transmitted by chocolate. Statistically, we rule out chocolate. There could be some correlation we don't know, but we get rid of it. So, we now have a tagged malaria mosquitoes, right. We then go to a public database. So, we step outside of academia. We go to a public database to see where in the last 10 years has malaria been registered, where has it happened? And we start pulling out statistics. It happens in this latitude, in this longitude, in this time-frame, in this season, okay. Then, the machine goes to the Copernicus satellite data for the last 10 years. And it starts looking at that latitude and longitude in the time-frames we mark, and in the time-frames we didn't, what was the temperature? What was the rainfall? What was the humidity? What was the water level? What was the geological situation at the time, right? And it starts building a picture, as you can see, of what's going on with this one-word malaria. We know mosquitoes are in play. Now we know when it happens, where it happens, and we start to get a picture of why it happens. We go to a public database of where chemicals are being sold to treat malaria. We can see a high percentage of certain chemicals were sold or medicines, and others weren't. We can see responses on what was working, what wasn't. We can go into the chemical database and see what were the active components in those medicines or compounds that we were using, with a net result that this machine has built a picture of the entire malaria ecosystem. We have a good idea what's causing it, we have a good idea where it actually happens, we have a good idea of the factors that support this happening, we have a good idea of chemicals that we were working in the past. And it may ultimately predict what compounds we could be using, or ways we could actually treat or work against malaria. It's done that in possibly a fraction of seconds, or less than a minute. That research would have taken researchers, if not 100 researchers, maybe 100 years, if not more. So, what we are really looking at here is a new way to interact with data that we can never interact with as humans anymore. There's already so much data out there, it's impossible for us to look through and find the correlations. I mean in one dataset, never mind combined datasets in the field and really datasets across fields through interdisciplinary research. So, this I think is the future of data. It's not just about open science in the sense of opening it up and being able to read it as a human, right or read a publication or look at a dataset. It's really about deploying a web of data, if you will, with help tools to look across the correlations, combined datasets for correlations we never would have found in our lifetimes for breakthrough science. And this is not machines replacing humans. This is a tool for me as a human to then actually go and do research and pick those topics. So, I think, yeah, this is the future for me in terms of open science. Thank you, Gareth. That was really fascinating and very informative to know about FAIR data. One question that immediately popped through my mind is, how do you see the other stakeholders taking to this? What's their response been to it, are everyone excited about it including publishers, academic societies? I'm excited about it. I don't see the same level of enthusiasm by many colleagues, however. And I think the reason for that is – I think the reason is that, like I said, researchers are very focused on their research. That's what they do. That's why they do it. And they don't often look at the bigger picture of what happens to my data afterwards and why am I doing this, right, why am I making my data open in any case? But it's typically framed from my perspective, and the colleagues I know, in terms of open science, and I open my article and they can come and read my article, and I open my data and they can reference it or use my data, right. But it stops there on an individual level. And what I am getting at here with a web of data is not an individual level, it's combining data, right, combining that research. And I don't just mean a sub-field of linguistics, or the field of linguistics, but really crucially all fields. So, it's a big, big picture that I don't think most researchers think of or are aware of, except those for instance working in interdisciplinary science, right, where they have to, by nature interact with different datasets, and then they immediately hit all the issues that we have with making data interoperable and accessible. How do I combine data from completely different fields in different formats with different metadata, different structures, different results, different communities, different cultures? And maybe researchers working in interdisciplinary science see the bigger picture in that sense. Now, the issue with the moment with data is, and it's a complex one with many aspects, is who makes it fair? So, who makes it fair, and who defines what fair is? And who checks what fair is, right? So, let me take the who makes it fair. A researcher ultimately will need to make the dataset fair, because they are the ones that know what it's about. We have the concept of what's called data steward in Delft University again, in the Netherlands, for instance, Technical University of Delft, by the former Rector Magnificus, Karel Luyben, who now is the president of the EOSC Association. So, you see the connection here. And Karel has been doing great work for many years on open science but specifically on opening up data and connecting data into a web of FAIR data and EOSC. We have this concept of data steward. And a data steward is basically a professional, probably somebody who used to do research or still does research, who supports a researcher in making their data fair. Now, what that practically means is they help the researchers understand what fair is, right. It's a set of protocols, and they help them implement that in their dataset. I did this many years ago for a linguistic dataset at the Max Planck Institute for psycholinguistics in Nijmegen in the Netherlands. And I was supported by two data stewards who needed to have our data in a specific format, so they can compare it with another 55 datasets from different languages across Europe. So, I was working on data coming from Ireland, Irish Gaelic at the time, and we wanted to be able to compare this data on these 55 to 60 languages from Europe. To do that, we had to be absolutely brutal in the format. It had to be completely the same, so the machines could look across it, right, and we could do real research. But it's not just the format. It's how you get to the information you put in the box in the cell, right. So, I was working on collecting, for instance, color terms in Gaelic. And this essentially involved me asking young Gaelic speakers in the Western Ireland, you know, to name, what's this color? And they'd say, in English, I'll say, red, yellow, blue, right. The point is that that's not a given across languages. We have different concepts of where we draw the lines or if there's even lines at all, right. In some languages that's one color; in another that's maybe two; in another that's maybe 10. So, I was asking them about color terms. And funnily enough, men are quite simple in colors, you get red, green, blue; women, that's blood red, x custard yellow, and sky blue, right. So, it's more described for some reason, which I don't know why that is. So, the point is I have to fill in a color, what color do I pick? Do I pick blue? Do I pick sky blue? Do I pick ocean blue? And I have to determine if that's just one form of blue or if it really is a different color. And the answer is in the word itself. I would code all of them as blue. However, the data steward will check to see, did I understand this correctly. Do I have a different word somewhere? So, they actually in a way almost peer review my data just purely from the way I presented in the logical structure, right. So, they helped me, first of all, structure it properly. They give me the template. They check what I am putting in there that is actually what should be in there. In many cases, it wasn't. You do the research yourself, you are so lost in the research you don't see the objective picture. And then they help add metadata. So, this is a dataset of Irish Gaelic. This is the language code to be able to find this as a machine. This is a dataset on color terms, on body parts, on spacial relations. And it goes into detail in what's in there. So, when it's finished, that dataset is fully ready for me in the future to do more research on it, but crucially to connect to these other 55 datasets. And in the future, to connect to billions of datasets to see is there something we can connect here. And maybe in the future they realize that color terms are connected to eye physiology, to weather patterns, to climate, who knows. I'll never know for now, but in a connected datasets space these correlations may come out to tell us not just that we have those differences, but why we have those differences, right, be they cultural or be they physiological. So, the dataset is the person that needs to help me do this. Now, why I say that? It has to be a professional as you can see. You can't just get a random person to help me with my data. And I can't do it because I'm not trained in this, and I'm still not trained in it. So, the idea here is that data stewards will help you make your data fair. That's one thing. But that takes time. It took me 3 years with my colleague. And as a consequence of making the data fair we never published an article, we were too busy with the data. So, I am negatively, what’s the opposite of – punished, because I made the data fair and open. You can access it, now a machine can use it, and nobody cares in academia, right. So, the first thing I'll say is, I’ll put on my CV, I published a FAIR dataset. They are like, great, where's the article? Well, I didn't get there yet. Yeah, but then it's useless, right, because they don't see a dataset as primary. So, I think we need to change our view on how we see what we do and what we do, right? So, we need to start seeing datasets as actual real outputs of research, not just the article telling you about it. They should go hand in hand. And they should be on the same level, because the data is primary, not the article, right. Anything goes wrong with the article, you need to check with the data, and everything flows from the data. Then there are questions of what do I publish? Like what parts of my dataset? Do you want the whole dataset even though the article was only on polar terms? Should I open up the body parts? Is this a finger? Is this – what's this called? What's this called? What's this called? Or do I open up only that piece that's related to the publication? Do I open up the final clean version of that data or should I give you all the versions that we had to go through to get there? Because maybe you want to see how I got there. Maybe you want to see that actually some people call that custard yellow, right, which can probably spark a whole research field in itself. So, what amount of the data do I open? And who looks after that data? I won't. Where does it go? Is it the university, and for how long? Is it 1 year, 5 years, and what happens in 5 years when the structure of the net changes or there are service changes, our technology changes, who transfers it to the new format, who checks that it's still there, and who guarantees that they can pay that for 25, if not 50 years? My data in Max Planck is secured for 50 years from the moment of publication by contract. Even if Max Planck goes bankrupt, they have a trust fund in place to support my data up until I think, 25 to 50 years, right. So, very much thought through ahead on long term data preservation. And then the crucial part of this is, who pays for all of this? Because if I have to make my data fair and publish five articles, I can't. So, an example would be instead of for a Ph.D. you publish five papers, which is typically the norm in Europe, or a monograph, you don't publish five papers, you publish one paper, and maybe you published the dataset with that paper, and that's it. And this is not my idea, by the way, this is Karel Luyben’s idea from President of EOSC Association, which I fully agree with. So, you reduce the focus on quantity, and you focus on the quality. You publish a qualitative article that you've worked on. You had the joke in the Netherlands was that articles are like a sausage, you could have one article, or you could put it into lots of mini-articles, which probably don't even deserve to be an article, but you can still get them published. So rather, you have one qualitatively high article, and then the matching dataset that's made fair and open, that's it, and you get rewarded for that. And independently, you can be rewarded for the dataset. But really, you should be able to be rewarded for both, and they are connected, right. Here's the dataset, where's the article that flows from that, if it's there? Or here's an article, it must refer back to a dataset, right. And then of course this goes back into the reward system of how you reward your researchers to do it. So, it's not just a matter of saying, we'll support you with data stewards, they'll help you do it, you have to give them the time to actually do it, or they get stressed, or they are not able to publish or they are not able to do other activities, and then you have to reward them when they go looking for a job or they go looking for a research grant. So, this is also the movement towards that change, right? So, if I apply for a top research grant, usually it'll say please name your top, your five highest articles, meaning Impact Factor, right? Or list all of your articles, and they are only looking at the Nature ones or the really high ones. What you want to start seeing is do you have an open dataset? Yes, and that should be scored positively. So, if I have five open datasets, and that person has one high level article, I should be getting the job. I've clearly done more work, and I am opening more for the community. So, yeah, as you can see, it's a whole interconnected system of researchers having to do the work, be given the time to do the work and the support. They need the actual physical, mental support to do this, right, in terms of making the data fair, and the physical infrastructure to open that up and support that in the long term. And then they need the rewarding system in place so that they do it in the first place. And it makes sense for them when they go for a job application or a grant from a public body. So, this is a whole suite of challenges. I think this is the reason why open science is slow. And I think this is why it will continue to be slow for 10, if not 25 years. We have to figure out all these steps. We have to re-structure the way we do science and the academic structure itself, until we eventually get to that’s the status quo, where we just go back to talking about science, in the sense then that science is by nature open. And let me jump in and say one more thing here because I wanted to be clear. When I say open, the statement that we use in Europe, especially for the European Commission is as open as possible, as closed as necessary. And that's a really fuzzy word. And that's intentional, because we don't always mean dichotomy open versus closed, right. There's a spectrum of openness in terms of how much of the, let's say, data that you open; none, all, of what versions, as I said, and as a spectrum of when. You can have a dataset that's fair and closed. And for instance, 3 years is I think an acceptable amount of time to work on your dataset, publish any articles, and then it should be open to the community to look at your data and do work on it, and maybe find things you didn't. So, it's taking time. It will take time. I don't think that's an issue. But what I do think is an issue is that we have to set up all of these mechanisms to facilitate it. Well, this is really fascinating. And again, just so many questions. But just interesting picking up on a couple of comments Gareth. You mentioned, of course because there are so many components to all of this it will take time. But it's interesting that you mentioned maybe it'll take about 10 years where it will still remain slow. But maybe after that things will accelerate and in 25 years we might look at things to be very different. Is there a tipping point that you have in mind where we might kind of start seeing that change? I think this will be modularized. It will incrementally go within specific practices. Open access is more mature, FAIR data is at the beginning. I mean, it's at the beginning. Open data has been happening, but it's not structured. Open source is quite advanced and will continue, right. So, we have different milestones moving ahead, which we bundled under the umbrella of open science. But back to the open point, like I said, not everything needs to be open, or should be open. Intelligence data, military data, dual use technology data, right, research that can be used for both public-private, possibly sinister motives should not be open, security data, confidential data, and so forth. This should never be open. It does have to be stored somewhere but it should never be open and certainly not connected in a web of data, right. So, that's one thing aside. And secondly, not all data needs to be open for it to be machine actionable. Like I said, an algorithm can go in and do research and pull out results, and I don't need to get access. So, just to be clear there on, let's say, how I see open. Now I think the tipping point has already come for open access, and at least in my view, that was the setting up of cOAlition S and this Plan S to achieve open access. I think this was the funding community taking a stand and say we're sick of the status quo, we are spending €10 billion per year on scientific publishing, we are not getting our value for money, and we don't have control over what we're paying for. We are not a client in our own system. We are paying, and we can’t set our standards. So, this was in my view, a step of the funding bodies and governments linked to those funding bodies, taking a stand saying we're paying, and this is what we want for our money, and this is the value we want for this service. So, I think open access in Europe, and it's extending because you can see the bodies joining cOAlition S are starting to, let's say, spiral around the globe. I think we're moving towards that. Even those bodies that haven't joined cOAlition S because they don't believe in it or they can't, are implementing their own open access policies. So, I think the tipping point has hit now several years ago for open access. And for me it's clear, the path is open. The question is how and the level of policy granularity we go in or what we want there in terms of who retains copyright, the type of licenses we want in place, how we actually give our access to articles. But I think it's more complicated for the data and some of these other practices in open science. Like I said, the real challenge and the real advantages in the data, how we make that fair, how we connect it. And then of course this is all useless without the right tools, yeah. So, you mentioned machine learning, artificial intelligence. I don't know how intelligent our algorithms are at the moment. We program them to do what we want. We give them biased datasets to learn from and work on. I don't think they are intelligent at all. There's no consciousness or sentience in that sense. So, like the rise of artificial intelligence in the context I am talking about, I don't see that happening in the near future. Maybe, it will, but I don't see it happening. So, there's no Terminator worry here from my side. So, maybe we talked about machine learning which is basically using machines to help us find correlations in datasets. I think that's challenging. You need datasets there for the machines to learn on, and then you can let them run wild and pull out correlations, right. And this is where we come back to like robots taking over the world. The machines are not – the algorithms the algorithms are not taking over anything. They are just simply going to be able to crunch huge amounts of data that I will never ever be able to do in 1000 lifetimes. And not just crunch it and tell me what's there, but connect it and see, is there a correlation? Now the correlations that come out of this entire dataset will probably be astronomical, and most of them will be completely useless if not spurious. The point is that somewhere in there in that jungle of data there will be correlations that could help us find cures for malaria or could help us find cures for Ebola or for COVID. Or could help us find new ways to develop nuclear fission or fusion technologies or new breakthroughs to advance into space and achieve near lightspeed capabilities or at least communication capabilities or quantum communication capabilities, right, things like humans may never get to in 100 or 1000 years. So, I think the real benefit of this web of data would with services on top to exploit that data, and specifically machine learning services to exploit that data, will be a profound gamechanger, which I think will be a type of data singularity. I don't mean a consciousness singularity, I mean a tipping point where everything is different, where we have advanced. Like you look at how we've advanced scientifically over the last few thousand years, right. And we go up in increments of hundreds of years. Thousands, then hundreds, then decades. And I mean, I grew up in the 80s. I’ve seen the Boombox, the record player, to the Boombox, to the tape player, to the Walkman, to a disc system that shuffled, to a minidisc, to a USB stick, and now to pure digital technology, high levels of quality in the space of what, 20 years, 40 years. So, I think this data is going to be, we are going to be hitting breakthroughs once that singularity hits at a rapid speed, and I mean in terms of years or months. And of course, then the next step will be how much money do we put into this to develop it, who develops it. And I think a crucial issue in the future will be to use technology and how we manage that. Imagine you have a connected dataset out there, we assume there's no intelligence data there, military data, or very cautious data, right, such as nuclear plans. But what happens with a connected data of billions of datasets and a very clever machine algorithm who starts correlating information that I've targeted, right. So maybe the plans for a nuclear reactor are not out there let's say as a whole, but in a combined dataset maybe they are out there spread across like grains of sand on a beach, right. And the machine learning algorithm can start piecing together the grades, maybe not all of it but enough to build a semi-picture and then by extension eventually build the picture, right. We don't know what this singularity will bring forth, right. And how we react to that in the future also I think is beyond our sight. So, we'll need some level of foresight to maybe prepare for that and to prepare for what could come from this. But that's science. That's society. I think that's a problem for society in the future that we'll have to address. I'll be putting in already now a level of foresight into what may come out of this 25, 50, 100 years from now. That's all the time we have today. Catch more of this episode in the next part with your host, Nikesh Gosalia. See you soon.