Facilitating automatic plagiarism detection in programming courses

This project was my 40 ECTS thesis, done at the Maerk Mc-Kinney Møller Institute. Below is a highly condense version of my thesis, the entire thesis can be retrieved here.

The demand for programming skills is ever rising due and as such mastering software development requires time and expertise, leading to a steep learning curve. Academic stress can drive students to cheat in programming assignments, blurring the line between plagiarism and collaboration. Plagiarized code can be tweaked to avoid detection, challenging professors. Universities struggle to efficiently detect and address plagiarism due to resource constraints, fostering a cycle of academic dishonesty. Addressing this, a platform is proposed to assist professors in detecting plagiarism in coding assignments. The scope focuses on aiding professors in assessing plagiarism through visualization and exploring current perceptions of plagiarism among educators. This qualitative project aims to build a user-friendly platform, leveraging existing plagiarism tools, and visualizing code overlaps for ease of detection. Research questions investigate professors' approaches towards plagiarism and the effectiveness of visualization in detecting code overlap. The project will involve surveys and observational case studies, focusing on real-world usability and professor feedback to enhance academic integrity.

Understanding current practices

Professors' strategies for preventing and identifying code plagiarism are investigated through a survey targeting educators that teach courses that contains software assignments. Despite a limited sample size, valuable insights are extracted. Demographics span gender and teaching experience, ensuring diverse perspectives. Reasons behind student plagiarism vary, with pressures to pass and ignorance of citation rules among noted factors. Plagiarism detection responsibilities tend to remain consistent throughout the semester for most professors. Manual review of submissions is the most common approach, possibly due to inadequate or unavailable automated tools. Emphasizing consequences is deemed important to deter plagiarism. Plagiarism detection software is seldom used for optional assignments and exams. Confidence in automated software's effectiveness is mixed, and concerns about false positives/negatives persist. Professors express varying confidence in recognizing plagiarism instances, suggesting potential benefits of automated tools in uncovering otherwise overlooked cases.

Visualizing similarity

In the context of explored visualization strategies and their merits and drawbacks, the table below highlights various methods. Among them, HeatMap, Box Plot, and Graph stand out due to their suitability for the project's objectives, despite inherent trade-offs.

Method	Pros	Cons
Scatter plot	- Suitable for displaying clusterings	- Can become cluttered with large datasets
HeatMap	- Suitable for displaying relationships and aggregating data	- Color scales need to be implemented correctly, less ideal for smaller datasets
Histogram	- Suitable for displaying distributions of overlap	- Less ideal for smaller datasets
CodeCity	- Developed to visualize code	- Not ideal for showcasing similarities
Word cloud	- Intuitive to understand, shows relativity between tokens	- Dependent on understanding underlying tokens, less precise
Box plot	- Good at displaying outliers and summarizing data	- Less reliable with smaller datasets
Dot plot	- Simple to understand	- Dependent on understanding underlying tokens to be useful
Graph	- Depicts relationships, suitable for displaying clusterings	- Requires filtering to limit the number of edges that clutter the graph

Based on evaluation, the HeapMap, Box plot and Graph were chosen to be implemented - they are briefly explored below.

HeatMap

This choice stems from its capability to connect students with multiple files, with block color intensity representing overlap. However, this assumes single overlap—a rarity. Each block should represent the highest similarity for overlapping files, addressing significant cases. This relies on repeated file names, limiting its use for tasks with diverse file creation. While it may not apply to all scenarios, HeatMap excels in drawing attention to high overlap instances, facilitating quick identification of problematic students, as seen below.

Box plot

The Box Plot excels at swiftly spotting replication concerns. Based on top similarities among repeated files, quantiles help identify high-overlap files. However, it assumes repeated files, making it less effective in projects with varying student-generated files. Overlaying a student's overlap with the population Box Plot can highlight deviations, assisting in detecting unusual submissions.

Graph

This visualization focuses on relationships and potential collaborations among students. Particular students are represented as vertices, with edges signifying comparisons. While valuable, the graph visualization assumes repeated files and might be limited in low overlap percentage scenarios. Clustering might be challenging to detect if not displayed nearby. By highlighting default top overlaps and indicating mutual comparisons, the graph emphasizes relationships.

Tooling analysis

It is important to note that the goal of this project is not to invent new plagiarism detection algorithms, but rather to analyze and leverage existing tools for effective similarity detection in software development.

Several popular plagiarism detection tools have been examined:

MOSS: Developed by Stanford University in 2003, MOSS is a widely-used closed-source algorithm that employs a web service for paper submission. It supports various programming languages and utilizes the winnow algorithm for detection.
Sherlock: An open-source algorithm from the University of Sydney, Sherlock performs non-lexical token searches to identify overlap between files, without language restrictions.
Plaggie: Designed exclusively for Java files, Plaggie is an open-source token-based plagiarism detection tool, although it's limited to supporting Java 1.5 and has minimal support.
JPlag: An actively developed open-source algorithm, JPlag supports multiple languages and uses a token-based approach with greedy string tiling. Notably, it can subtract framework and library code.
SIM: A publicly available token-based algorithm from 1999, SIM can be extended for new languages but has limited support for modern languages.
CodeMatch: A commercial solution supporting a wide array of languages, the inner workings of CodeMatch are less described.

The research indicates that tools like YAP3 and Marble, which were previously popular, are no longer available. The examined tools all follow the process of comparing submitted files, resulting in a time complexity of O(n^2).

While each tool has its advantages and disadvantages, recent advancements in AI and machine learning have demonstrated their potential in detecting overlapping code. Modern AI approaches can better handle code relocation and restructuring. MOSS poses data security issues due to its closed-source nature and centralized submission server. Plaggie is outdated and supports only legacy Java versions. Hence, MOSS and Plaggie are excluded from this project's scope. Instead, the focus will be on JPlag, Sherlock, and SIM due to their availability and features.

Before proposing any implementation, it's crucial to understand the data input and output mechanisms of different algorithms, as these dictate the code modeling process. Given the divergent nature of output from each algorithm, understanding commonalities and differences is paramount. This understanding guides the design of interfaces for the algorithms. While commonalities streamline the integration process, addressing discrepancies ensures flexibility.

Three critical expectations shape the data structure design:

Overall Overlap: This describes the total overlap between two submissions, represented as the number of overlapping tokens, a percentage, or line count.
File-based Overlap: Similar to overall overlap but on a per-file basis, revealing the extent of similarity between individual files.
Line Descriptors: These provide insights into specific lines or tokens that overlap, typically presented as "line start" and "line end" descriptors. Although tokens are usable, their interpretation varies across algorithms.

Additionally, automatic programming language detection capability and other supplementary data generated by each algorithm are noted.

The following paragraphs elucidate the output structure of each algorithm in relation to the aforementioned points.

JPlag

JPlag, as the most extensive algorithm, supports multiple programming languages and offers two interaction approaches: using its Java library or command-line interface (CLI). The CLI operates on a specified input directory of student submissions. Running the CLI with the installed JDK generates JSON files that detail output. For instance, "overview.json" contains statistical data and suspected clustering of students, and individual comparison files offer overlap information. The output is structured and lends itself to efficient post-processing.

Sherlock

Sherlock, operated through its command-line interface, mandates specifying file extensions to analyze and allows setting a threshold. The output is in CSV format, listing matched files and their overlap scores. While lacking grouping, Sherlock's output requires significant post-processing to derive meaningful insights.

SIM

SIM's executables support multiple languages, but correctly choosing the executable is essential. The output lists files analyzed and their corresponding tokens. Overlaps between files, including common lines and tokens, are exhaustively listed. However, SIM's output lacks standard formatting and necessitates careful extraction using techniques like regular expressions.

Comparison

Each algorithm presents unique strengths and weaknesses, which are summarized below.

	JPlag	Sherlock	SIM
Language Detection	Automatic	N/A	Manual
Overall overlap	Yes	No	Yes
File-based overlap	Yes	Yes	Yes
Line descriptors	Yes	No	Yes
Output type	JSON	CSV	Custom
Post-processing	1	2	5

This post-processing score is subjective, reflecting experimentation.

Challenges

The project's findings provide valuable insights into how academia approaches plagiarism in coding and the understanding professors have of the reasons behind student plagiarism.

Plagiarism detection in code

Professors commonly find detecting plagiarism in coding assignments to be challenging, often resorting to manual inspection. To deter plagiarism, professors emphasize student awareness, but this approach might reinforce cheating if undetected. Implementing automatic plagiarism detection, despite its perceived inefficacy, has a preventive impact, as supported by research.

Assignment complexity

Not all coding assignments are equally suitable for plagiarism detection due to varying complexity. While plagiarism detection algorithms can identify overlapping code, the differing sizes of codebases and tokenization methods may distort the perceived level of overlap. The effectiveness of removing boilerplate code is limited when assignments lack a common structure or heavily deviate from the original base code. This observation underscores that the presented plagiarism detection is most effective for simpler tasks with clear instructions.

Tokenization increases similarity

Tokenization is essential to analyze and compare code. Ironically, the tokenization process designed to detect code similarity can inadvertently increase the likelihood of finding overlaps. This phenomenon is particularly problematic when tokenization is combined with similarity algorithm. The issue arises from how tokenization generalizes code, making different files appear more similar than they actually are. This situation necessitates professors' manual inspection to differentiate legitimate overlap from cases where structural similarity is misleading. Consequently, tasks may never exhibit 0% average overlap, leading to confusion among users.

Effects of Removing Base-code

Removing base-code to determine real overlap between students involves tokenizing the original base code and subtracting the base code token count from overlapping code. While this approach approximates actual overlap and is easy to implement, it can introduce skewed results. The visualization shown below illustrates that if a student modified the base code, subtracting the base code tokens may remove more than intended. This underscores the challenges in accurately assessing overlap and highlights the necessity of manual inspection to avoid misinterpretation.

Base-code refers to code a professor may provide as part of an assignment. Typically a starting point for the students.

Presenting similarity

Visualizations reduce professors' workload by allowing them to engage with multiple submissions simultaneously. Visualizations, such as the HeatMap, aided professors, but usability issues arose. More legends and tooltips could enhance understanding. The color scheme of progress bars influenced professors' focus, but a dynamic approach based on overlap distribution is recommended. The side-by-side comparison view was well-received but could benefit from a clearer color palette.

Cognitive workload

Cognitive workload in the context of plagiarism detection arises from the overwhelming task that professors face when evaluating a large number of student submissions to identify instances of plagiarism. During the experiment, professors struggled to retain a comprehensive understanding of every student's codebase, making it difficult to pinpoint potential cases of plagiarism. One of the key issues was the need for professors to actively inspect each submission, which could quickly become unwieldy and result in instances of plagiarism going unnoticed.

Using a submission prioritization system, wherein the most potentially overlapping submissions are presented first, aimed to help professors allocate their attention more efficiently. However, even with this strategy, the complexity of inspecting numerous submissions, each with varying degrees of overlap, remained a concern.

To counter this, a marking system was implemented, allowing professors to flag suspicious cases for later review. This feature aimed to mitigate the burden of instant decision-making during the detection process and enabled professors to revisit and evaluate flagged submissions more conveniently.

Extensibility

The framework used (Laravel) in this project boasts a service container implementation that greatly facilitates the adoption of Inversion of Control (IoC). Through this implementation, extensibility is empowered, as it enables each plagiarism algorithm to be accompanied by a provider responsible for its registration. This becomes notably advantageous in the future, should external contributors seek to incorporate new algorithms into the platform. By following this approach, they can develop a new package aligning with the exposed interfaces in this project.

This methodology also carries the benefit of substantially diminishing coupling and the need for integrators to have in-depth contextual awareness. While some information about related models remains necessary for the integrator, the remainder of the platform can be entirely disregarded. This approach not only streamlines the integration process but also enhances the platform's modularity and adaptability.

Results

Through a qualitative study, this project aimed to understand how professors perceive and address plagiarism in programming courses. The study revealed diverse professor views on student plagiarism motivations, ranging from ignorance to a sense of impunity. Notably, obligatory assignments were only occasionally scrutinized by half of the participants, allowing plagiarizing students to evade detection, weakening deterrence.

Furthermore, the study highlighted limited adoption of automated plagiarism detection during exams and professors' perception of its ineffectiveness. This likely contributes to the restrained use of automated tools, particularly when integration is complex or cumbersome. The project explored visualization's potential in identifying patterns of problematic student behavior. Among eight assessed techniques, the HeatMap, Graph, and BoxPlots were integrated into the platform.

Addressing code plagiarism faced challenges due to shared code foundations from frameworks and libraries, inflating similarity measures. This complexity burdened professors, while tools like JPlag were enhanced to account for base-code. Importantly, plagiarism tools can't definitively determine plagiarism; professors retain the final decision.

Visualizations

Visualiaztions are significant due to how they can condense a large number of comparisons into a small image. The HeatMap's ability to convey the similarity between files and students is unique and useful as it enables the professor to spot not only which students that are problematic but also which files could be problematic. However, the natural downside, especially for the heatmap is that it may not be immediately clear what it is trying to convey.

In larger tasks with 50-100 submissions, figuring out which students have worked together by manual inspection is virtually impossible. The graph especially shines in this regard; even if the professor is able to spot some students that look similar, the clusterings in the graph make it almost instantly apparent what students may have worked together. Based on the feedback received in the qualitative experiment, professors find the visualizations useful as well.

Contributions to open source

During the integration of JPlag, some output data was absent. Specifically, JPlag would output the number of tokens matched in each compared file but not the total number in each file. By itself, the tokens number means little as the proportionality of overlap between two files is unobtainable.

Since each tool's interpretation of a token varies, the total tokens in each file must be derived from JPlag's internal engine to ensure consistency. I have created a pull request that has been merged into the JPlag repository, such that the output JSON files now contain the number of tokens found in each file, which remedies this issue. Diving into the JPlag source code was surprisingly beneficial in getting a much deeper understanding of the inner workings of such a plagiarism detection tool.