Editor’s Note: Today’s post is by Daniel S. Katz, Scott C. Edmunds, and Mohammad Hosseini. Dan co-founded and is Associate Editor-in-Chief of the Journal of Open Source Software and is IEEE Computer Society Publications Integrity Chair, along with being the Chief Scientist at NCSA and Research Professor in the Siebel School of Computing and Data Science and the School of Information Sciences at the University of Illinois Urbana-Champaign. Scott is Editor-in-Chief at the Open Science publisher GigaScience Press, publishing software and data papers via its GigaScience and GigaByte journals, and he is also on the Advisory Boards of the Dryad Digital Repository and Make Data Count. Mohammad is an Assistant Professor of Preventive Medicine (Biostatistics and Informatics) specializing in ethics at Northwestern University Feinberg School of Medicine, an associate editor at Accountability in Research, book series editor at Research Ethics Forum, and a member of Springer Nature’s US Research Advisory Council.

Computing techniques and software have consistently supported research since their inception, and continue to play a significant role in knowledge production. However, the means of communicating research methods and results were developed long before computing existed, and, thus far, the research community is sorely lacking best practices for comprehensively documenting the computational elements of research in a transparent, reproducible, and reusable manner.

While some specialist software journals have existed since the 1970s, journals that publish general software papers like JOSS and JORS began in the 2010s. These were part of a community effort to recognize software in research, which also led to the emergence of software citation principles, the FAIR4RS initiative, publisher policies encouraging the inclusion or citation of software in papers, standardization of software metadata, and efforts to archive software in general repositories (e.g., Zenodo) and specialist services (e.g., Software Heritage).

Concept of a thief or spy with a backpack getting away with a cyber crime. On the floor is scattered digits and binary code elements, hinting at digital or cybernetic themes. The overall ambiance is moody and dramatic, suggestive of a high-tech heist or espionage activity.

Now that publishers have become more accepting of (and in some cases, require) the inclusion of software (typically, source code) associated with the submitted manuscripts (either on their own platforms or on archival platforms), many also see an obligation to support peer review or validation processes to vouch for the integrity of software just as they do for other content. For example, ensuring that ethical and legal concerns such as authorship, plagiarism, copyrights and licensing are addressed in a fair and consistent manner remains central to their mission. However, unlike debates about originality and access to scholarly text that have been mainstreamed for decades, the ethical and legal debates about research software are still in their infancy.

For example, unlike the usurpation of text, code plagiarism is not frequently discussed in responsible conduct of research or research integrity training. Given how common it is among software programmers to share and reuse code, it is not even clear what exactly constitutes code plagiarism or how to label or otherwise distinguish new code (Finley, 2017). The way researchers fork and reuse code makes this even trickier because a detailed history of where the code came from, how it evolved, and who contributed to it can be entirely absent.

Furthermore, licensing for software (and understanding and applying licensing) is more complicated than text and data, mostly due to the more complex interactions and usage scenarios (e.g., derivatives and patents that relate more to code than data). Indeed, with a much wider range of Open Source Initiative licenses for software than standard licenses for data and text, and the many terms and conditions requiring that various permissions are maintained, interoperability and how different parts of code are combined can be complicated and multi-layered.

Integrity of Published Code and Software Can Be Negatively Affected by AI

The rise of AI-powered coding tools takes existing challenges to a whole new level. Google’s recent announcement that 50% of their coding is now completed using AI-assistance is intended to demonstrate how quickly adoption is growing. Novel tools are rapidly transforming software development, with the promise of enhancing the productivity and accessibility of coding.

One may argue that AI is not taking over coding from human developers, but rather augmenting their capabilities. Although such gains may be true in some contexts, similar to the increasing usage of AI tools in writing text, when using AI to write code, responsibility for the integrity of the code is diffused. While most researchers may have sufficient English knowledge to understand, edit, and take responsibility for AI-generated text, it is not clear that the average research programmer, who likely understands their discipline quite well, also has sufficient software knowledge to understand exactly what their AI-generated code does. This limitation can lead to uncertainties about AI-generated code and raise questions about researchers’ responsibility and accountability for the code. Furthermore, reviewers and editors may have more difficulty assessing AI-generated code than AI-generated text. For example, while there are numerous tools to detect AI-generated text (albeit imperfect), there are no such tools for detecting AI-generated code.

AI can also translate code between different programming languages. Currently, code created in this manner from someone else’s original code is almost impossible to detect, increasing ambiguity about who deserves credit and who is responsible for copyright compliance, and making licensing issues even more challenging to determine.

While there are some tools for detecting code plagiarism (e.g., copyleaks), none that we know of work for AI-powered coding and code translation. As with narrative content (and following COPE guidelines), AI is a tool rather than a thinking entity, and should not be credited as an author. However, as the training data for these systems can be very opaque, AI coding tools could be plagiarizing the efforts of others without the initial users’ knowledge, or with no attention to appropriate attributions and licensing. In November 2022, a class-action lawsuit was filed against GitHub, Microsoft, and OpenAI, with the plaintiffs claiming that the GitHub Copilot coding assistant was trained on open source software from programmers’ public projects without care for licenses, IP rights, or appropriate credit for the original creators. The legal and technical ambiguity here makes it extremely difficult to oversee and regulate AI-generated code with current legal and ethical guidelines. This grey area also raises the question: who should take responsibility for dealing with this situation?

Although COPE and International Committee of Medical Journal Editors (ICMJE) were relatively quick to respond to the increased use of AI in writing papers through issuing positions on Authorship and AI tools and updating authorship guidelines in 2023, thus far, no authoritative organization or initiative has adopted a serious position on the rise of AI use in research software development and the associated ethical issues. While some of the authors of this piece were involved in the FORCE11-COPE working group to draft the COPE data ethics guidelines, such multi-year efforts may not be suitable here as practice and possibilities around code and AI are evolving so quickly that the target is constantly shifting.

How does AI Make Publishing Software and Its Ethics Even More Complicated?

Those who manage journals that publish software (including some of the authors here) are starting to observe undeclared and significant AI use in coding, including code translated between different programming languages by AI (undetectable by standard code plagiarism tools), and authors who do not use compatible licenses for such AI-adapted code. We have tried to deal with these cases by adapting procedures and guidelines (e.g., those developed by COPE) used for dealing with potential plagiarism, attribution issues, and undeclared use of AI in writing text. However, licenses for code and software are very different from those for text.

Likewise, as with generative AI trained on text and images, there is a lack of clarity and transparency regarding the sources that were used for training and the legality of training these models with those sources. Authors in The Scholarly Kitchen and other venues have extensively covered challenges faced by publishers in terms of tracking the licensing and attribution of LLM-trained scholarly content. These issues have made artists and book authors furious for what they see as “theft” of their efforts and intellectual property. Relative to code, AI blurs the line between genuine coding and copy-paste plagiarism because its outputs look legitimate. Furthermore, a researcher can use AI to generate functional software or code without really knowing the concepts behind them or having the ability to detect errors/biases that may have been created by, or missed by AI. This raises serious questions about integrity, accountability, and whether researchers actually have the skills needed to even explain the code they are presenting themselves as responsible for.

It is only fair to ask whether researchers are responsible for the integrity of code generated by or with AI, and what should editors do when they discover that submitted code may have been adapted and translated from others’ code or libraries, possibly unethically or illegally? And how can the software review and publication community deal with these new challenges?

COPE has stated that “Authors who use AI tools in the writing of a manuscript, production of images or graphical elements of the paper, or in the collection and analysis of data must be transparent in disclosing in the Materials and Methods (or similar section) of the paper how the AI tool was used and which tool was used,” but a glaring omission in this position statement pertains to coding and research software.

Responding to the Challenges

The scholarly community and especially editors need guidelines about how to deal with transgressions and unethical practice.

Of course, catching and trying to correct bad practice is often too late, and it is perhaps more straightforward to try to tackle these issues upstream through education and literacy efforts. As with the increasing usage of AI in writing scholarly text, researchers need better resources and more training in how to use AI tools for coding in an ethical, transparent and reliable way. Those developing curricula and teaching software development and coding, therefore, have a critical role to play here and need to be empowered and equipped to be able to promote best practice.

The Research Software Alliance (ReSA) has recently started a PubSoft Forum to bring together publishers and others interested in issues related to how research software is seen in scholarly publishing, including how AI is used. Readers who are interested in participating in discussions around the topic of this article are invited to join the forum by contacting [email protected]. The authors of this post are also happy to engage in future collaborations on this topic.

Daniel S. Katz

Daniel S. Katz

Daniel S. Katz is Chief Scientist at the National Center for Supercomputing Applications (NCSA) and Research Professor in the Siebel School of Computing and Data Science and the School of Information Sciences, at the University of Illinois Urbana-Champaign. He is working to make research software more sustainable by encouraging scholarly incentives and academic career paths for those who work on it, policy changes for its recognition and support, and best practices to make it easier to develop and maintain. He is co-founder and Associate-Editor-in-Chief at the Journal of Open Source Software, co-founder and Steering Committee Chair for the Research Software Alliance (ReSA), and co-founder of the US Research Software Engineer Association (US-RSE).

Mohammad Hosseini

Mohammad Hosseini

Mohammad Hosseini is an Assistant Professor of Preventive Medicine (Biostatistics and Informatics) specializing in ethics and integrity in research, AI ethics, and open science at Northwestern University Feinberg School of Medicine. He is an editor at Accountability in Research, book series editor at Research Ethics Forum, and a member of Springer Nature’s US Research Advisory Council. As an ethics educator, he promotes the responsible integration of AI into research practices, aiming to help researchers anticipate risks, uphold scientific integrity, and engage thoughtfully with societal impacts.

Scott C. Edmunds

Scott C. Edmunds

Scott C. Edmunds is Editor-in-Chief at the Hong Kong-based Open Science publisher GigaScience Press, publishing reproducible and open software and data papers via its GigaScience and GigaByte journals.  He is also on the Advisory Boards of the Dryad Digital Repository and Make Data Count. With a research background in cancer genetics and 20 years experience in Open Access publishing, he has more recently taught data management and curation at Hong Kong University. Having an active interest in Citizen Science, he has co-founded Citizen Science organisations Bauhinia Genome and CitizenScience.Asia.

Discussion

3 Thoughts on "Guest Post – Code Plagiarism and AI Create New Challenges for Publishing Integrity"

This seems like a category error to me. While I can understand the policies on using an LLM to write the text of a paper, I find it hard to be upset about researchers using LLMs to write code, and not just because everyone is already doing it. The text of a paper means something, code does something. Sure, you can have clever algorithms for something, but you wouldn’t consider it a novel contribution if someone implemented your clever algorithm in another language, would you? What you want to be credited for is the idea, so whether you used an LLM or not doesn’t seem like it should matter.

Hi William, We didn’t write this piece because we were upset about researchers using LLMs to write code, it was more to get the publishing community to think about the practical consequences of this technological advance. As with LLM generated text the genie is out of the bottle and we just need to make sure people use these tools responsibly and ethically. There is a big black hole in policies and legal guidelines around all this, so if things do go wrong (plagiarism, lack of attribution, licensing issues, etc.) how are we supposed to deal with them? Part of my personal motivation for writing this was because my journal had a submission to deal with that had many of these issues flagged, and we discovered there were no COPE or other guidelines to work from. And the more we looked into it, the more it raised are lots of interesting and complicated issues to work through. For example looking at the GitHub Copilot court case that we cite here, the LLMs could be plagiarising other peoples code without the users even knowing, so if journal editors accused authors of code plagiarism could they argue they didn’t know and it was the AIs fault? I’m glad this topic is provoking discussion anyway, and thanks for your feedback.

It’s a shame there’s not better guidance for editors here, and I think we will eventually get guidance along the lines of the guidance for methods sections. In other words, the words used to write the methods or the code aren’t the novel scholarly contribution, but rather the underlying idea that was implemented to solve a problem in a novel way. Does that seem likely to you as well?

I think you’re right that you’re going to see a lot of novel algorithms used by LLMs when prompted by people who are not the original author and they may not even know the original author of the code or even that an author exists. It will be hard to hold them accountable in this case! I personally wish LLMs were better at citing their sources, but for now it would seem like the actionable issue would be if an author claimed to have created a novel algorithm and can’t substantiate a claim to the idea by a version history of a code repository or other evidence demonstrating work on the problem over time. As LLMs get more capable and agentic, they’re going to solve more problems in one shot, authors won’t have this trail of work to show, and I would be very hesitant to have requirements here.

It may be helpful to step back a bit to look at the broader changes LLMs and multi-modal models are bringing to science. The overall goal of scholarly publishing is to accelerate research by facilitating the exchange of information, right? Credit needs to be allocated where it’s due for funding and tenure purposes, of course, but assuming a world where LLMs can create novel solutions to most software problems also assumes a world where less and less software is written by authors. I know we spent decades working to make software and data legitimate scholarly objects, as opposed to just papers, but now is not the time to hold on to this in the face of much broader changes to how science is done.

The fact of the matter is that LLMs were not built by and for scholars and they generally don’t preserve attribution, and researchers should use them where it accelerates their research program. The question for publishers is how to not get in the way of this while still uploading scholarly standards.

I’m not in a decision-making position on any of this, so no one needs to read or care what I think, but I still care about making things better for researchers and it’s fun to think about. Appreciate your post!

Comments are closed.