Pairwise Image Matching for Plagiarism Detection

Daniil Dorin^{1 📧}, Kseniia Varlamova¹, Andrey Grabovoy¹

¹ Antiplagiat Company, Moscow, Russia

^📧 Corresponding author

This work addresses the critical problem of detecting near-duplicate images in scientific publications, particularly in medical and biological research. The core challenge is to determine whether two given images are:

Near-duplicates: One image was derived from the other through manual transformations (true plagiarism), or
Merely similar: Two distinct images sharing visual content but not derived from the same source.

Why This Matters

Traditional image retrieval systems often fail to distinguish between these two scenarios. While content-based image retrieval can identify visually similar candidates, it cannot determine if one image was manually manipulated from another $-$ a crucial distinction for plagiarism detection in academic publishing. Near-Duplicate Transformations Our system detects images that have undergone common manual manipulations, including:

Rotations and mirroring
Grayscale conversion
Contrast adjustments
Cropping and resizing
Blurring and noise addition
Combinations of these transformations

The Classification Challenge

The key difficulty lies in differentiating between:

Class 1 (Near-duplicate): Images where one was derived from the other through manual manipulation (e.g., a grayscale version of the original).
Class 0 (Similar but distinct): Images that share visual content but are fundamentally different (e.g., two different cells under a microscope).

Unlike general image similarity tasks, our goal is not to measure visual resemblance but to detect whether one image was specifically derived from another through manual transformations $-$ a critical distinction for plagiarism detection in scientific contexts.

Proposed Solution

We implement a Siamese neural network architecture with:

Various encoders including EfficientNet-B3, ViT-L/16, CLIP ViT-H/14, and a Barlow Twins encoder using a ResNet50 backbone. Some encoders are kept frozen to compare representations obtained from our training with their contrastive encoders.
A fusion module that employs a symmetric function to ensure invariance to input order, crucial for a stable scoring function.
A classification head that predicts the probability of a near-duplicate relationship, implemented as a Multi-Layer Perceptron (MLP) with a single hidden layer, followed by ReLU activation and dropout for regularization, concluding with a sigmoid function to yield a similarity score.

The system outputs a similarity score representing the probability that the second image was derived from the first through manual manipulation, rather than simply sharing visual content.

This architecture allows us to effectively distinguish between near-duplicate and merely similar images, providing a robust solution for detecting plagiarism in scientific publications.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
configs		configs
data/coco2017		data/coco2017
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.ipynb		example.ipynb
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py
run_fpr_benchmark.py		run_fpr_benchmark.py
run_train.py		run_train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pairwise Image Matching for Plagiarism Detection

Why This Matters

The Classification Challenge

Proposed Solution

About

Uh oh!

Releases

Packages

Languages

License

DorinDaniil/Pairwise-Image-Matching

Folders and files

Latest commit

History

Repository files navigation

Pairwise Image Matching for Plagiarism Detection

Why This Matters

The Classification Challenge

Proposed Solution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages