Insights Xchange: Conversations Shaping Academic Research

Fake Photos, AI Tools, and Research Fraud with Elisabeth Bik

May 15, 2023 ScienceTalks Season 2 Episode 16
Insights Xchange: Conversations Shaping Academic Research
Fake Photos, AI Tools, and Research Fraud with Elisabeth Bik
Show Notes Transcript

Research fraud is increasingly common these days. Will AI tools just make this worse or can they help combat fraud? Nikesh Gosalia talks to Elisabeth Bik about research fraud, starting with the plagiarism incident that inspired Elisabeth to become a full-time research integrity volunteer. She talks about the main factor pushing researchers to commit scientific fraud: the pressure to publish. Despite the high volume of papers being published these days, Elisabeth feels that quantity doesn’t always reflect quality. She discusses the development of AI tools and their effect on scientific fraud, such as the use of AI tools to create fake photos. Conversely, there are tools to detect fraudulent elements, but their limitations mean that human input is still needed. She also shares her techniques for detecting research misconduct, from examining duplicated photos to questionable animal ethics. Finally, Nikesh and Elisabeth touch on issues surrounding open science and research misconduct. Elisabeth thinks the adoption of open science will help reduce fraud in academic publishing.

Dr Elisabeth Bik is a science integrity consultant and microbiologist. With a PhD in microbiology, she has worked in academic publishing as a science editor and director of science. Featured across multiple mainstream media outlets, Elisabeth received the 2021 John Maddox Prize for her work on exposing threats to research integrity. She can be reached on Twitter.

Insights Xchange is a fortnightly podcast brought to you by Cactus Communications (CACTUS). Follow us:

I'm very worried about AI developments, tools like ChatGPT, but also even more importantly, tools that can create images that look very real. It is very worrisome because we no longer can distinguish real from fake. And also impact factors. And you can argue if those are all the right reasons to focus on, but I don't think publishing in Nature or publishing 20 papers a year makes you necessarily a better scientist than a person who publishes one really good paper in a lower impact journal per year. Is there ever a day with zero fraud? More science doesn't need better science. It's just we need to slow down science because it's just going too fast. Or maybe I'm just sounding like an older person now. The type 3 duplications are images that contain elements within the photo that have been duplicated. All Things SciComm. What does the future of science look like? What's happening in science communication? Here's your host, Nikesh Gosalia. Today, on All Things SciComm, with millions of research articles published each year, who looks into promoting integrity within the scientific community. That's where Elisabeth Bik comes in. As a science integrity volunteer since 2019, she's been dedicated to uncovering instances of fraud and manipulation in scientific academia to ensure that the publishing ecosphere remains honest and trustworthy. Her passion for scientific integrity has seen her report over 6000 manuscripts for image manipulation or other concerns. She's been featured across multiple mainstream media outlets, and she received the John Maddox Prize in 2021 for her work on exposing threat to research integrity. Elisabeth, it's an honor to have you with us today. I'm so happy to be here. For any listeners who are unfamiliar with your work, which I highly doubt, but… They do exist, I'm sure. Okay, so for those who exist, can you tell us a little bit about yourself and what do you do in the scientific community? Yeah, of course. I'm Elisabeth Bik. I was born and raised, and I did my Ph.D. in the Netherlands, so I'm Dutch. But I've lived now for 21 years in the US. I'm a microbiologist by training, and I've worked 15 years at Stanford as a postdoc and staff scientist on the microbiome, the bacteria that live inside our bodies and that of dolphins also. Around 2013, I got interested in plagiarism. I discovered a couple of papers, one paper had stolen one of my sentences, and I found more and more papers that had stolen other scientists' sentences. It just made me mad. I worked on that for a couple of years during my full-time job at Stanford. I also worked two years in the industry, and I kept on doing this as a hobby, but I switched from plagiarism to looking at scientific papers and specifically images. I have some talent of discovering duplicated images, either photoshopped or just plain, like an error plain duplicated. I've been doing this for a while. A couple of years ago, I quit my job and I do this now full time, as a volunteer. I try to spread the word about signs of misconduct by using Twitter and giving talks at universities and for scientific publishers. Very impressive. Just to kind of talk a little bit about manipulation and fraud, Elisabeth, was that the first time that you came across when you experienced it yourself? Or was it that you had seen it before, maybe you thought this is a one off and then kind of just went into this a little later? I hadn't seen it before. I had heard about science misconduct vaguely when people talked about it, similar to the way people are interested in reading about murders, or, I don't know, robberies of paintings or so. It sounded like far away and intriguing, interesting, and maddening also, because it goes against everything that science should be. But I hadn't seen any misconduct. I think I have been very fortunate to have worked in good labs with honest people, and I assumed that every scientist was honest and would never resort to cheating. But yeah, it turns out that there are some scientists who do that. Before we go into maybe discussing this a little bit more, Elisabeth,- I mean, because you've been a researcher yourself, why do people do this in the first place? Is it just a lot of pressure on researchers? Are there more reasons? I mean, I'm just very curious to know why in the first place. I do think there has been an increased amount of pressure on scientists to publish. If you think about older generation of scientists that are the Einsteins and the Newtons, I don't think they were pressured to publish so many papers a year. They could do what they wanted, but I think science has become an industry where we are held accountable as scientists to the amount of papers we publish and also impact factors, and you can argue if those are all the right reasons to focus on, but I don't think publishing in Nature or publishing 20 papers a year makes you necessarily a better scientist than a person who publishes one really good paper in a lower impact journal per year. I don't think this measuring a scientist's output by looking at numbers of papers and citation factors and impact factors are necessarily good measures. But this is the reality. We are held accountable for how many papers we publish. We already might be expected to publish a paper during our Ph.D. I think this is one of the reasons that drives people to cheat. There are certain countries where these requirements are much stronger than in the US or in Europe. In those countries, we tend to see more suspected cases of science misconduct, purely because people are just required to do impossible things, to publish X number of papers, even though they're not given the time to do research, for example. With some of the staggering numbers that we just spoke about in the introduction, is there ever a day with zero fraud? Or is it fair to say that fraud is probably an inevitable part of scientific research considering the quantum of papers that are being published? I think papers with fraud are published every day, just because the amount of papers that are published every day, all papers, is staggering. That amount is – I wouldn't even know. But I think there's millions of scientific papers published per week probably. I have no idea. I'm just making a guess. But it seems that the amount of scientific papers being published has gone up in the past 10 years or so, and it's just going up every day. I think scientific papers become– there are so many of them. As a scientist, you cannot really keep track of what happens in your field anymore. It's just too much. Maybe we need to slow down science a little bit. I do think the amount of papers, the request for peer reviewers has gone up. Peer reviewers tell me, people who regularly do peer review, they say,"I get like dozens of requests a week, I can only do one or two." But it seems that the demands on peer reviewers, on editors, on journals have all gone up, the pressure has gone up, and it's just going to not necessarily lead to better science, like more science doesn't need better science. It's just we need to slow down science because it's just going too fast. Or maybe I'm just sounding like an older person now, that things are going too fast for me, but I feel that sentiment also with younger people. I hear that quite often as well, Elisabeth. Do you think – I mean, just on the topic of quantum of papers or just the amount of science that's coming out, with the push towards open access and even now, ChatGPT, do you think that will just accelerate even more, and there'll be more need to have stronger systems in place, robust processes? Well, open access, in general, I think, is a very good development. We need to make science accessible for everybody, no matter if you are a scientist or not, maybe just a patient, but we're all taxpayers, we all have contributed to science. I am a firm supporter of open science, but that doesn't necessarily contribute or prevent fraud. I do think that developments like open science, which is slightly different, this is where you make everything available, not just the paper available for free, but also the data behind it. I think that could prevent fraud. But there are also journals who make misuse of the open access model. There are predatory journals who are not good journals. They just are there for the profit. They will ask the scientists to give them money, and then they'll publish the paper for free, but there's no good quality control at all. We have seen an enormous rise in open access predatory journals who make misuse of the otherwise great open access model. If that is what you're asking, I think that open access is good. But we need open science, which is making all the data available to prevent fraud, that will make a contribution. Otherwise, you can just make up data, right? It's very easy to do that. I'm very worried about AI developments, tools like ChatGPT, but also even more importantly, tools that can create images that look very real. I don't know, I mean, the development of that is staggering. We've seen very believable photos of, let's say, Trump being arrested. I was just looking yesterday and those photos, you can still see it's fake, and you can laugh about it. But it is very worrisome because we no longer can distinguish real from fake. We tend, as humans, to rely on our eyes. You know, seeing is believing. And so, if we see a photo, we tend to think now it's real, like you don't just hear a story. If you see a photo, you believe it more so than just hearing a story. If you see a photo that looks realistic, and this might just be a photo of cells or tissues, you could actually make a convincing story that your experiments were real. And so, these developments are going to hinder science because as a peer reviewer or a reader, you cannot really tell that these papers are fake. I'm very worried about otherwise exciting technical developments, but in the hands of the wrong people, they can do a lot of harm. Is there any official data, Elisabeth, in terms of the numbers around fraud in scientific academia? Based on your experience, how much of maybe total research published in these journals could be considered fraudulent? Yeah, I did some research a couple of years ago on that to answer that exact question. I wanted to know specifically, if you look at papers that have photos, so papers with photos of mice, or plants, or gels, or blots, or microscopy photos, how many times would you find images that have been duplicated inappropriately? Two photos that represent two different experiments. One of them is right, and one of them is wrong, presumably, or even photos that have been tampered with. Photos with duplicated elements, like the same cell visible twice in the same photo. We looked at that. I looked at a set of 20,000 papers, I scanned them, and I found 4% of these papers to contain images that have been duplicated, and we estimated about half of them, so 2% of that set, had been deliberately tampered with. Either photoshopping or just moving your sample under the microscope a little bit so you see a slightly different view, but you have an overlap or so. So, 2% of all papers had science with photography, images that might have been done intentionally. But I also think that the real percentage of fraud has to be higher, because I can only look at photos, and there's so many ways you can fabricate your data, you could make a line graph that looks convincing or a table. It's much harder to detect fraud in those types of data than in figures. I can see figures, but I'm catching only the tip of the iceberg. The real percentage of fraud has to be higher than 2%. We think it's between 5% and 10%. I've heard people even say it's higher, but I don't know, I still believe that most scientists are honest. But I think 5% to 10% is my best bet. I might be completely wrong, but it's based on some data. You mentioned, Elisabeth, that you've been looking at this area for quite some time. You're able to identify instances of fraud much more easily than probably somebody else. But just, again, for the benefit of our listeners, how do you tell when there is an instance of fraud? Are there any very obvious telltale signs or falsified data or images? I mean, where would you kind of look? I would look at the images mainly. That's my specialty. There are three types of image problems that I can detect. One would be using the same photo twice to represent two different experiments. Like I said, a photo of mice or a photo of cells, or tissues, or blots, or DNA gels or things like that. The second type is where it's not just the exact same photo, there's a shift. The photo is either rotated, or mirrored, or just shifted a little bit. For example, if you take two photos of some tissues under a microscope, and you move your sample a little bit under the microscope and take another photo, but you move it enough so the photo is slightly different but there's still an overlap for me to find, that will be, what I call, a type 2 duplication. While the type 3 duplications are what I said before, images that contain elements within the photo that have been duplicated, so the same cell visible twice or the same mice visible in the same photo, suggesting that the photo has been digitally altered, photoshopped. That's the type 3 and that is very likely to have been done intentionally. Those are the things I focus on. But there's many other problems I have found just by looking at papers with a critical eye. Examples could be conflict of interest. As an extreme example, let's say there's a paper that claims that cigarettes are not harmful for your health, but it's sponsored by a cigarette manufacturer. You would not really trust that paper because there's a clear conflict of interest in the message of the paper and the sponsor. That will be something I could raise or not disclosing that. Animal ethics, for example. There's no approval by the university to do animal experiments or photos of tumors on mice that are very big, suggesting that animals have really suffered. I mean, those are things that are not technically misconduct, perhaps, but it's definitely wrong. Like, this is something I could raise, or a methodology where people don't have a control group, or the control group is very different already than the treatment group at the start of the experiment. Those are all things you can raise. There are many things that can be wrong with a paper. I'm focusing on images, but I found many of these examples above myself. I know I was at the STM event in London, Elisabeth, before the end of the year, and there was a startup day. I think there were a few startups who were talking about using technology to detect some of these fraudulent elements. But in your experience, are there any sort of tools or applications you've come across which successfully helps you detect these fraudulent elements. Yes, I've been working with several tools, detecting image duplication. And so, a couple of years ago, I collaborated with DARPA, the Defense Agency in the US, working on image duplication detection, image forensics, and it turns out that what our eyes can see pretty easily is pretty hard to detect for a computer tool. It took a while before these tools have been developed. But now they are available since one or two years. There are several in the market. Of course, these are focusing on image duplications. They cannot detect, to the best of my knowledge, image manipulation, but they can detect duplicated elements. These tools, obviously, are very interesting for scientific publishers who want to test their incoming manuscripts for all kinds of problems, including plagiarism, they have a tool for that. They now can also test these manuscripts for image problems. Because it's a computer tool, it can also detect images. Some of these can detect images that have been republished, that had been published before in other papers. Across papers, they can detect image duplication, which is for human very hard. I cannot remember thousands of images from hundreds of papers, but the software can detect that. I'm using those. They're pretty good. They still completely miss some obvious things that I just see with my eyes, and you still need a human to click on that and see of all the things that it flagged are realistic, because sometimes duplications can actually be appropriate. Those will be flagged by the software and a human still needs to tell, okay, this is a real problem versus this is quite okay, I'm going to disregard this, this flagged image. Going back to the whole topic about AI tools like ChatGPT, Elisabeth, like you correctly mentioned, there are two schools of thought there. There are certain things where it can help us to improve efficiency, maybe automate some of the very rule-based processes, but at the same time, there is potentially an issue where you could use these tools to further create fraudulent activities for lack of a better term. I think as soon as ChatGPT came out, in a couple of weeks, I read Forbes had done a survey and 89% of students were already using ChatGPT in some form or the other. I'm just wondering, with AI being used to create maybe false information in a few cases and also being used to detect artificial information, I mean is there a balance we can achieve? Or is there one kind of winning over the other? How do we really tackle this? That's such a complex question. I do think we – it's not a technology we should prohibit in all circumstances, because it can be wonderful. For example, to help a person for whom English is not their first language to write an article in English, like write maybe something themselves first and then have the ChatGPT or some other tools make it better, make just the English better. I think that will incredibly help people who have always been at a disadvantage to writing scientific papers. Even though I've lived 20 years in the US, for me, English is not my first language, so I had to learn it. But at least now I feel privileged, I can speak it, I can write it, but that's not the same for everyone. It could tremendously help people for whom writing in English – as a scientist, you have to write in English. That's the accepted language of our publications. And so, it will help those folks. I think there are tremendous advantages to using such tools. I might use it myself one day. I haven't really played with it. But I could just write an introduction or so by using a tool like that. But so far, I've been very disappointed with what the tool produces. Because it's not really scientific; it's too general. It doesn't really help me write a scientific paper, but it will one day, I'm sure. It could, of course, sample a lot more data than any human could read. There's lots of advantages. But like I said before, in the hands of the wrong people, it can generate misinformation. It already has. There are some funny examples on Twitter and in articles about completely false information that it just presents as if it was real. People have argued, well, if students all use that, they're not learning to write anymore. I think there's sometimes too much emphasis, at least I find in the US on writing, I feel that all answers need to be in the form of an essay. That's not how I was taught in high school. We just had to write a very short answer, like this is the formula, this is the correct translation or so, but I have never written many essays myself, and I don't see that it's always useful, although people will probably disagree with me on that. I feel we need to put more emphasis on critical thinking. I feel that is a skill that has been lost by a lot of students. Let them do more critical thinking and don't focus so much on essay writing, because I feel that is something that now a computer apparently can do better than a human or will be able to do so next year. I'm not sure if that's a skill we should emphasize. It's similar to, can we have students use a calculator? I think that was a discussion a lot of years ago, and there are settings in which a student cannot use it. But there are settings now, in our professional worlds, we all use calculators, and we don't feel that's a lost art. I do feel we need to embrace new technology. But again, there are with this one making up false data is just something that worries me. I know you've mentioned this when we were talking about open access, Elisabeth, that only open access is not going to be the solution. Open access in general, of course, is great. But I think along with open access, we have to look at open science. As far as your experience is concerned, Elisabeth, I mean, where do you think we are in that journey? You think there is more support that we are receiving from the industry as a whole in terms of moving towards open science, or you think we are still fairly slow, and it's been very frustrating to see lack of progress? I'm seeing some progress, but it is going very slow, indeed. The Netherlands is actually moving towards like more open science models. I feel they're one of the frontrunners on that field. In the US, not so much. I do see that more and more journals are starting to say, oh, you need to upload the original blots or the original gels or original photos. But I'm seeing very few authors actually following up on that. It seems that it's more a suggestion and not everybody's doing that yet. But this is different than a couple of years ago. It's starting to become a requirement for more and more journals, although I have not really seen too much progress. But I hope this is a trend that will continue, because if I see a paper with – and I have some doubts about, let's say – if that data really existed, but the paper is a couple of years old, and you say to the authors, hey, I have some concern about this figure, it looks like one cell is visible twice, I would not have expected that, or these error bars look very small or these data have been duplicated. Can you show me your original data? Very often, the authors will– if they reply at all, they will reply, well, the data is lost, the paper is older than five years, we moved our laboratory and the books got lost or the hard drive melted or there was a tsunami or there was a tornado or an earthquake. A lot of disasters seem to happen in these labs. You cannot really know if it was real. But I hear that so often that you have to sort of assume this was some – the dog ate my homework type of excuse. If the data are shared upfront, while the manuscript is submitted, the data is still fresh, right? It's still warm. It just came out. There's never that excuse that the author can use that the data was lost, because they had already submitted it upfront. I hope this is a development that more and more journals will continue, will require. And yeah, it is a lot of work for the authors. But you already have the data at that moment. When you submit your paper, the data is still fresh. Five years from now, you don't remember exactly what you did, you cannot remember where the file was, and it will just be much harder. I do hope with the open science movement where these requirements are getting more strict from journals, that it will make it harder to fraud or to find errors, find original data, we need to still be able to have all that data. Please ignore if this question sounds a little bit silly, Elisabeth, but just a curiosity that popped in my mind. What would stop publishers – I don't know, the whole universe of publishers, to move towards this, like say tomorrow that you have to publish all your data, we have to move towards open science, is it more the systems as an issue? Or is it more that authors probably find it very tedious? Or is it a combination of a few things? I mean, it needs more storage, right? Like all that data needs to be stored somewhere. But I think computer storage, data storage is not really an issue anymore, right? Like that is solvable. There are servers, and we can dump the data somewhere. We no longer have to say you can only send 100 kb file, and we cannot handle anything bigger. That has already been solved. Yeah, authors will protest, they will say it's a lot of work. There are types of data that are hard to share. Like if you have a sequence run, like that might be a couple of gigabytes, like are you really sharing the very original data? Or can it already be a little bit cleaned up? Or if it's a photo of under the microscope, how can you prove it's the original photo? I think there are some issues that you cannot really trust if a person sent you the data that was real. I think there's, from my perspective, no real problem there. This is all technically doable. Yeah, it requires a little bit more effort, but you already have the data. I don't see a problem. You have to think about patient data. You cannot really share obviously names of patients or their informed consent forms. But there might be some arguments about that, like I don't want to share my patients. Sure, yeah, I think for a lot of data, you can just upload them. That's all the time we have today. Catch more of this episode in the next part with your host, Nikesh Gosalia. See you soon.