Using word embeddings as a semantic method for plagiarism detection

Summary:

  • Word embeddings (Word2Vec, GloVe, FastText, BERT) detect plagiarism by capturing semantic similarities beyond exact word matches.
  • Embedding models enable effective identification of paraphrased and synonym-substituted content, outperforming traditional methods.
  • Contextual embeddings (e.g., BERT) provide superior accuracy but require higher computational resources compared to static models (Word2Vec, GloVe, FastText).
  • Hybrid embedding approaches combining static and contextual embeddings offer balanced solutions for practical semantic plagiarism detection.

Plagiarism is the unacknowledged reuse of someone else’s text, and it remains a serious challenge in academia and publishing. In the digital era, copying and rephrasing text has become easier than ever. So detecting plagiarism effectively is crucial. Traditional plagiarism detection methods typically rely on exact text matching or simple lexical metrics. For example, they often compare overlapping sequences of words or use bag-of-words representations. These approaches perform well for verbatim copying. However, they often fail to catch disguised plagiarism. Such disguised plagiarism occurs when the original text is paraphrased rather than copied word-for-word. In particular, a plagiarist may rewrite sentences by changing word order, inserting or deleting phrases, or substituting words with synonyms. As a result, the passage has the same meaning but different wording. Conventional tools based purely on lexical matching struggle in such cases. This is because they do not understand the semantic equivalence between terms (Chang et al., 2021). As a result, strongly paraphrased plagiarism or plagiarism by translation can evade detection by simple string matching.

To address this gap, researchers have turned to semantic similarity measures that go beyond surface text. In recent years, advances in natural language processing and machine learning have provided new ways to represent textual content. These new representations capture semantic relationships between words and sentences in a mathematical form. One powerful approach uses word embeddings to measure document similarity based on meaning rather than exact wording. Word embeddings are vector representations of words learned by neural network models. Word embedding models such as Word2Vec, GloVe, and FastText, as well as contextual models like BERT, encode words (and even whole sentences) as high-dimensional vectors. In these embedding spaces, semantically similar words lie close together. By leveraging these embeddings, a plagiarism detection system can identify content that is conceptually similar even when there are no exact matches. In this way, it can detect plagiarism involving paraphrasing or synonym substitution.

This article provides an overview of word embeddings and how they are applied to semantic plagiarism detection. We focus on extrinsic plagiarism detection – i.e., comparing a suspicious document to external sources. We also explore how each of the aforementioned embedding techniques can be used to uncover plagiarised text. We compare their performance and discuss implementation considerations. The goal is to demonstrate that embedding-based approaches enable much more robust plagiarism detection beyond lexical overlap. They improve our ability to catch concealed plagiarism while minimising false matches.

Word embeddings: capturing meaning in vectors

In order to detect semantic similarity, we first require a representation of text that encodes meaning. Word embeddings are dense vector representations of words which capture linguistic context and semantic relationships. Unlike a simple one-hot encoding (where each word is an independent dimension with no notion of similarity), word embeddings place words into a continuous vector space such that words used in similar contexts end up with similar vector representations. This idea is rooted in the distributional hypothesis. This hypothesis states that words used in similar contexts tend to have related meanings. Embedding models train on large corpora of text to learn these representations. Through this process, they associate each word with a vector that reflects its contextual usage patterns.

For example, in a well-trained embedding space, vectors for “doctor” and “physician” will be very close to each other. These two terms often occur in similar contexts and thus have almost interchangeable meanings. Likewise, a word like “bank” has a representation that is closer to “river” in one context and closer to “finance” in another. This shows how meaning is captured by context-dependent embeddings. Distances between word vectors (typically measured by cosine similarity) correlate with semantic similarity. Words that are synonyms or related concepts have high cosine similarity even if they are lexically different. This property is precisely what plagiarism detectors need to catch paraphrased content. If a suspicious text replaces “city” with “metropolis”, a good embedding model will still recognise these words as similar. It will signal a potential match.

Modern word embedding models are generally trained using neural networks or advanced matrix factorisation on massive datasets. The next sections introduce several influential embedding methods – Word2Vec, GloVe, FastText, and BERT – explaining how they work and how they differ. Understanding these models provides a foundation for applying them to plagiarism detection.

Word2Vec model

Word2Vec is a seminal neural embedding model introduced by Mikolov and colleagues at Google in 2013 (Mikolov et al., 2013). It initiated a revolution in NLP by showing how to efficiently learn high-quality word vectors from large corpora. Word2Vec comes in two main variants: CBOW (Continuous Bag-of-Words) and Skip-gram. Both are shallow neural networks that learn to predict linguistic context. In the CBOW architecture, the model learns to predict a target word given its surrounding context words. In the Skip-gram architecture (more popular in practice), the model does the reverse. It tries to predict the context words surrounding a given target word. As the model trains on billions of word sequences, it adjusts the internal vector representations of words. Words that appear in similar contexts end up with similar vectors.

The outcome of Word2Vec training is a learned vocabulary of words. Each word is associated with a vector typically of a few hundred dimensions (often about 300). These vectors encode meaningful semantic and syntactic relationships. For instance, Word2Vec famously demonstrated linear relationships like vector(“King”) – vector(“Man”) + vector(“Woman”) ≈ vector(“Queen”). This result illustrates that the embedding captured the concept of a gender analogy. More directly relevant to plagiarism detection, Word2Vec embeddings place synonymous or related words close together in space. This means that if a plagiarised passage replaces certain words with synonyms, a comparison of Word2Vec embeddings can still detect a high similarity. This is because the vectors of the original and substituted words align closely in the embedding space. Furthermore, Word2Vec’s training objective inherently captures some notion of topic. Words about finance, for example, cluster together and are distinct from words about healthcare. Therefore, even when a plagiarist paraphrases using different terminology, a Word2Vec-based comparison can reveal the similarity. Both texts still discuss the same underlying concepts.

Another advantage of Word2Vec is computational efficiency. It can be trained relatively quickly on large datasets. Applying it to new text is straightforward and fast, especially when using pre-trained word vectors. This made it feasible for plagiarism detection systems to incorporate semantic similarity scoring in real time. Typically, a plagiarism detection workflow using Word2Vec would proceed as follows. First, represent each document (or each sentence/paragraph) as an aggregate of its Word2Vec vectors. For example, one can take the average of all word vectors in the segment, or use more refined measures like the Word Mover’s Distance (discussed later). Then compute the cosine similarity between the vector representation of the suspicious text and that of a candidate source text. A high cosine similarity indicates that the two texts are semantically alike, even if they share few exact words. This suggests that one text may be a paraphrase of the other. In essence, Word2Vec provides the building blocks (word-level semantics) to enable this semantic comparison.

It is important to note that Word2Vec provides context-independent embeddings. Each word has a single vector regardless of the sentence it appears in. As a result, Word2Vec cannot inherently distinguish different senses of the same word. For example, “bank” as a financial institution and “bank” as a river edge share one vector representation. Despite this limitation, Word2Vec has proven highly useful for semantic plagiarism detection. It is especially effective when combined with techniques that consider the overall document context (Leong et al., 2018). It was one of the first tools to significantly improve detection of paraphrased plagiarism compared to naive keyword matching.

GloVe model

GloVe (Global Vectors for Word Representation) is another widely used word embedding model. It was developed by Pennington, Socher and Manning at Stanford in 2014 (Pennington et al., 2014). GloVe differs from Word2Vec in its training methodology. Instead of predicting context through a neural network, GloVe is a count-based model that leverages global word co-occurrence statistics. The core idea is to derive word vectors by factorising a matrix of co-occurrence counts. The dot product of two word vectors then reflects the probability that those two words co-occur in the corpus.

In practice, the GloVe algorithm constructs a large matrix of word–word co-occurrence frequencies. Each entry in this matrix corresponds to how often word i appears near word j in a large corpus. The algorithm then factorises this matrix (using a weighted least squares objective) to yield lower-dimensional vectors for each word.

The resulting GloVe vectors are comparable in quality to Word2Vec vectors. They capture similar semantic relationships and analogies. Words that often co-occur with the same other words will end up with similar embeddings. For example, GloVe will place “metropolitan” near “urban” and “city” in the vector space. This is because those words share similar co-occurrence patterns. A subtle advantage of GloVe’s global approach is that it can incorporate statistics about co-occurrence ratios. For example, it encodes relationships like “ice” is to “cold” as “fire” is to “hot” via appropriate vector differences.

In the context of plagiarism detection, GloVe can be used exactly the same way as Word2Vec. Often, the two are interchangeable for computing semantic similarity. A document vector could be obtained by averaging GloVe word vectors. Alternatively, one could compute cosine similarity or WMD using GloVe instead of Word2Vec.

From a plagiarism detection perspective, there is no fundamental difference between applying GloVe versus Word2Vec. Both produce static word embeddings that enable semantic comparisons. The choice may depend on practical considerations. For example, one might prefer whichever model has better pre-trained vectors available for the relevant language or domain.

In literature, some studies note minor differences between the two. For instance, Word2Vec (skip-gram) can sometimes capture rare word semantics better due to its sampling strategy for frequent versus infrequent words. By contrast, GloVe may better utilise global corpus statistics for common words. However, both are powerful in identifying paraphrases. If a suspicious text replaces “quickly” with “rapidly” or “large” with “huge”, both GloVe and Word2Vec embeddings would reflect these pairs as highly similar. Thus, a plagiarism checker using GloVe embeddings would flag the texts as semantically close despite the vocabulary change. Ultimately, GloVe reinforced the idea that any strong word embedding can serve as a basis for semantic plagiarism detection.

FastText embeddings

FastText is an extension of Word2Vec developed by researchers at Facebook AI (Bojanowski et al., 2017). FastText introduced a crucial innovation. Instead of learning vectors only for whole words, it also learns vectors for character n-grams (subword units). Each word’s embedding is essentially the sum of the vectors of all the character n-gram substrings that compose the word. (This sum also includes the vector for the whole word itself.)

For example, consider the word “international”. FastText would break this word into character n-grams like “int”, “nte”, “ter”, …, “nal”. It represents the word’s embedding as the combination of all these subword vectors.

This has two major benefits for semantic modeling. First, it naturally incorporates morphological information. Words that share roots or affixes have overlapping n-grams and thus similar representations. For example, “teach”, “teacher”, and “teaching” will influence each other’s vectors due to shared substrings. Second, it effectively handles out-of-vocabulary and rare words. Even if a word was not seen often (or at all) in training, FastText can approximate an embedding for it. It does this as long as the word’s constituent character sequences were seen, by summing those subword vectors.

For plagiarism detection, FastText’s properties are extremely useful. Plagiarists sometimes attempt to evade detection by introducing slight misspellings or creating new compound words to confuse exact matching. A static Word2Vec or GloVe model would fail on an unknown word, since it would have no vector for it. By contrast, FastText can generate a reasonable vector for a misspelled or novel word based on its character components. Similarly, if a rare technical term is paraphrased into an equally rare synonym, Word2Vec might not have good vectors for either term. FastText is better equipped in this scenario because it leverages subword information.

Indeed, studies have found FastText to outperform Word2Vec and GloVe on tasks of short-text semantic similarity and paraphrase detection (Chawla et al., 2021). This superior performance is likely due to FastText’s strengths with rare words and morphological variants. FastText retains the speed and simplicity of Word2Vec training (it uses a similar skip-gram objective under the hood). At the same time, it boosts robustness on infrequent words.

For example, consider an original text stating “anthropogenic emissions significantly influence climatology.” A plagiarist might change this to “man-made discharges greatly affect climate science,” altering many words. Traditional detectors might catch “influence/affect” or “climatology/climate” via stemming or synonyms. However, they would probably miss other changes. FastText, however, will recognise “anthropogenic” and “man-made” as related concepts. The subword “anthropo” (meaning “human”) appears in both words, effectively bridging “anthropogenic” to “man” in the vector space. It will also see overlap between “climatology” and “climate” vectors, since “climate” is a substring of “climatology.” As a result, a FastText-based document embedding will still show high similarity between the original and paraphrased sentence. This ability to leverage internal word structure makes FastText especially powerful for detecting plagiarism in domains with technical jargon. It is also very useful when authors use obfuscation tricks like concatenating words or adding prefixes/suffixes.

BERT and contextual embeddings

Word2Vec, GloVe, and FastText all provide static embeddings, meaning each word has one vector regardless of context. However, modern NLP has moved toward contextual embeddings that generate vectors for words in context. The most prominent example is BERT (Bidirectional Encoder Representations from Transformers). Devlin et al. (2019) introduced BERT in 2018. BERT is a deep transformer-based language model. It is pre-trained on massive text corpora using self-supervised objectives (masked language modeling and next-sentence prediction). This training results in a model that can produce rich embeddings for entire sentences. Each word’s representation is influenced by the words around it.

In fact, BERT provides a special embedding for the whole sequence (commonly the [CLS] token output). This vector can be used as a representation of the entire sentence or document.

The key advantage of BERT for plagiarism detection is its deep contextual understanding. BERT’s vectors are context-sensitive, so it can distinguish different meanings of the same word based on usage. It also captures subtle nuances of syntax and semantics in a sentence. For instance, consider the word “bank” in “the bank was flooded after the storm” versus “the bank approved my loan.” A static embedding would represent “bank” the same in both sentences, which could be misleading. BERT, however, will produce different vectors for “bank” in each sentence. It aligns the first with concepts of water (river bank) and the second with finance. This context sensitivity reduces false similarities and false differences that static models might produce.

In a plagiarism scenario, this means that BERT can gauge more accurately whether two passages convey the same idea. It remains effective even if complex paraphrasing has altered the structure or if polysemous words have been used differently.

Applying BERT to plagiarism detection often involves using it as a backbone for a sentence or document similarity model. One straightforward approach is to use BERT to encode each sentence of the suspicious document and each sentence of a candidate source document. Then one can compute cosine similarities between all sentence pairs to find potential matches. BERT’s embeddings are very informative. Thus, a truly paraphrased sentence will usually show a high similarity score to its original counterpart, even if few words overlap. More advanced approaches fine-tune BERT specifically for paraphrase identification or plagiarism detection. They train on pairs of texts labeled as plagiarised or not plagiarised. Such fine-tuning (for example, using a Siamese network or a classification head on BERT) can significantly improve accuracy. BERT learns the exact notion of “same content” vs “different content” through this process (Reimers and Gurevych, 2019).

Indeed, recent research integrating BERT into plagiarism detection pipelines has reported impressive results. For example, Latina et al. (2024) developed a hybrid model that combined Word2Vec and BERT embeddings with other NLP features. This model achieved over 93% accuracy in detecting paraphrased plagiarism. Similarly, Moravvej et al. (2022) showed that a BERT-based approach outperformed earlier methods (like LSTM networks or static embeddings) on benchmark datasets. These results underscore that BERT’s deep semantic representations can reliably catch paraphrases that elude simpler techniques.

BERT’s power does come with a computational cost. BERT is a large model. It is slower and more memory-intensive to use than Word2Vec, GloVe, or FastText, especially if many pairwise comparisons are required. For systems that must scan millions of documents or operate in real time, using BERT naively might be impractical. Solutions to this include using distilled versions of BERT (like DistilBERT), which are smaller and faster. Another solution is to restrict BERT-based analysis to promising candidate pairs that have been pre-filtered by a lighter-weight method.

As hardware and optimisation techniques advance, contextual models like BERT are becoming increasingly viable to deploy at scale. Their ability to grasp near-complete semantic equivalence between differently worded texts makes them a cornerstone of the state-of-the-art in plagiarism detection.

Semantic plagiarism detection using embeddings

Having introduced the core embedding models, we now turn to how these representations are used in practice to detect plagiarism. The overarching idea is to measure the semantic similarity between pieces of text using their vector representations. There are two broad approaches: unsupervised similarity computation and supervised classification. Many plagiarism detection systems combine both, using unsupervised methods to narrow down candidates and then applying a trained model for final decisions. Below, we outline typical methodologies for embedding-based plagiarism detection.

Representing documents and computing similarity

The first step is to obtain vector representations for the texts under comparison. For example, you would encode the suspicious document and each document in the reference corpus as vectors. If working at the document level, one might derive a single vector for each whole document. Alternatively, for finer-grained detection, one can derive vectors for smaller segments (paragraphs or sentences) and then compare those to pinpoint specific plagiarised passages. The representation strategy depends on the embedding model:

Using static embeddings (Word2Vec, GloVe, FastText):

A common approach is to compute the average (or another aggregate such as sum) of all word vectors in the text segment. This yields a fixed-length vector representing the entire segment. The averaging approach is simple and ignores word order. However, it often suffices to capture the topic or overall semantic content of the segment. Two texts that share meaning will have similar averaged vectors even if they use different words. For example, averaging word vectors for “A quick brown fox jumps over the lazy dog” produces a vector not too far from that for “A speedy auburn fox leaped over a sleepy canine,” because the two sentences contain similar sets of words. Other variants include using a weighted average (e.g., weighting by TF-IDF to give more importance to significant words). Instead of averaging, one could also train a Doc2Vec model (Le and Mikolov, 2014) which directly learns an embedding for entire documents. Doc2Vec was designed to capture document-level topics beyond individual words, and it has been applied to plagiarism detection as well (Chawla et al., 2021). Once each document or segment is embedded as a vector, the similarity between two pieces of text can be quantified by the cosine similarity between their vectors (or a distance metric like Euclidean distance). Cosine similarity is most commonly used due to its interpretability (identical texts give cosine similarity ~1.0, unrelated texts near 0). A high cosine similarity indicates that the texts lie close together in semantic space, suggesting potential plagiarism if one text was expected to be original.

Word Mover’s Distance (WMD):

Rather than collapsing a document into a single average vector, one can use the Word Mover’s Distance. This innovative metric directly computes a distance between two texts by considering the distances between individual word embeddings (Kusner et al., 2015). WMD treats one document as a “bag” of embedded words and asks: how much “travel distance” is required to move the words of document A to the locations of the words of document B in embedding space? It’s essentially the Earth Mover’s Distance applied to word distributions, using distances between Word2Vec embeddings as the cost. If two documents are very similar in meaning, WMD will find a low-cost alignment. In such a case, most words in document A can be matched to semantically close words in document B, so only a small “moving” distance is needed. If documents differ in topic, words must travel further in semantic space to match up, yielding a larger distance. The advantage of WMD is its granularity. It inherently finds which words or concepts in one text correspond to those in the other, even if they are synonyms or rephrasings. This makes it highly effective for comparing short texts like individual sentences or paragraphs that have been paraphrased. For instance, consider “economic growth slowed down” and “the expansion of the economy decelerated.” WMD would pair “economic” with “economy”, “growth” with “expansion”, and “slowed down” with “decelerated”. It would then compute a very small overall distance because each pair of words is very close in embedding space. Empirical evaluations show that WMD is extremely effective for detecting paraphrases. However, it is more computationally heavy than simple averaging or cosine, since it involves solving an optimisation (a linear programming problem). In large-scale systems, WMD is often applied only to a subset of likely candidate pairs. Nonetheless, it’s a valuable tool when high accuracy semantic matching is needed.

Using contextual embeddings (BERT):

With models like BERT, one can obtain a vector for an entire sentence or paragraph directly. For example, one might use the final-layer [CLS] token representation or average all token embeddings from BERT’s output for a segment. Specialised models like Sentence-BERT (Reimers and Gurevych, 2019) fine-tune BERT to produce particularly meaningful sentence vectors that can be compared via cosine similarity. Given BERT-derived vectors for text A and text B, one can again use cosine similarity as the similarity measure. A threshold might be set such that if cosine similarity exceeds, say, 0.85, the texts are flagged as likely paraphrases. BERT can also be used in a pairwise fashion by feeding both texts into the model together and having it output a classification of “same content” vs “different content.” This pairwise approach often yields even higher accuracy, because the model can directly align the meanings of the two inputs. However, it requires more computation (each pair must be run through BERT) and training data of plagiarism vs non-plagiarism pairs. In summary, contextual embeddings provide very accurate similarity measures. They capture not only word-level synonymy, but also whether two sentences as a whole are paraphrases, considering word order, syntax, and context.

Other semantic features:

In addition to embedding comparisons, plagiarism detectors may use related semantic features. For example, a thesaurus or ontology can complement embeddings by explicitly mapping rare synonyms. Some systems compute semantic n-grams or concept fingerprints — sequences of embeddings or concepts that appear in both texts. Another strategy is clustering words into high-level concepts (Chang et al., 2021). Chang and colleagues clustered Word2Vec embeddings into “semantic concepts” using spherical k-means. Each document is then represented as a vector of concept frequencies rather than raw word frequencies. Two documents can be compared by their concept vectors. This approach is more robust to vocabulary differences, because words in the same concept cluster are treated interchangeably. Chang et al. demonstrated that such concept-based representation significantly improved detection of heavily disguised plagiarism. These techniques essentially build on core embeddings to derive features more tailored to plagiarism detection.

Once a similarity score or feature set is obtained for a document pair, the system must decide if it indicates plagiarism. The simplest approach is to set a threshold. For instance, if the cosine similarity between two paragraphs is above 0.9, the system marks them as a match. In practice, setting the threshold can be tricky. If it is too low, you get false positives (unrelated texts on the same topic might score moderately high). If it is too high, you may miss some paraphrases. It helps to impose a minimum match length to avoid flagging short common phrases. It’s also useful to consider the portion of text involved (e.g., require that at least 50% of one text’s content is covered by the similarity). More sophisticated systems use machine learning classifiers that take multiple inputs. For example, they may feed in the similarity score, the portion of text matched, and some lexical overlap metrics to produce a final decision. Regardless of specifics, embedding-based methods can catch cases where large portions of a text have been plagiarised in disguise. The suspicious document will “light up” as unusually similar in content to a source document, even if it shares few exact strings.

It is often best to combine semantic analysis with traditional methods for optimal performance. For example, one efficient approach is to first perform a conventional n-gram search to retrieve a small set of candidate source documents (filtering out obviously unrelated material). Then, apply the embedding-based semantic comparison only between the suspicious document and those candidates. This two-stage approach is used in many modern plagiarism detection frameworks (Potthast et al., 2014). It leverages the speed of lexical matching to narrow comparisons, and then the power of embeddings for nuanced judgment. Hybrid metrics have also been proposed. For example, some methods combine longest common subsequence matching with word embedding similarities (Sánchez-Perez et al., 2014) to capture both structural and semantic resemblances. Overall, word embeddings greatly strengthen plagiarism detection, but they work best when integrated thoughtfully into a pipeline that accounts for practical constraints.

Example workflow

To illustrate a semantic plagiarism detection system using word embeddings, consider this high-level workflow:

  1. Preprocessing: Normalise the texts by lowercasing, removing stopwords (optional), and splitting into meaningful units (e.g., sentences or paragraphs) for comparison. Some approaches also lemmatise words. However, with modern embeddings this is often not necessary, since the embeddings handle different word forms.
  2. Embedding representation: Choose a pre-trained embedding model (e.g., FastText Common Crawl vectors or a BERT-base model) that suits your language and domain. Use this model to obtain vector representations for each unit of text. For static embeddings, compute an average or TF-IDF weighted average of word vectors for each sentence/paragraph. For BERT, take the [CLS] token vector from the last layer as the sentence embedding.
  3. Initial candidate selection: If you have a large collection of source documents, first narrow the search. For each suspicious document, compare some global features (like the average embedding of the whole document) against all source documents using a fast nearest-neighbour search. Alternatively, use keyword overlap or hashing to shortlist likely candidates, since a semantic comparison for every pair is expensive.
  4. Pairwise semantic comparison: For each suspicious document and each candidate source, compute a similarity score for each segment pair. For example, calculate cosine similarity between every sentence of the suspicious document and every sentence of the source. Identify the maximum similarity values or any pair above a chosen threshold. If using WMD, compute the distance between the documents or between relevant segments. If a supervised model is available (e.g., a fine-tuned BERT classifier), utilise it. Run each suspicious–source segment pair through the model to get a probability of plagiarism.
  5. Decision logic: Aggregate the evidence from the pairwise comparisons. If any segment pairs exceed a high similarity threshold (e.g., cosine > 0.9) or if a large portion of the document shows high similarity, flag the source document as containing plagiarised content. Highlight the specific matching segments. This step may involve rules like “at least N words or M% of the text is similar” to avoid trivial matches. It may also involve averaging the top-k similarity scores to produce an overall document-level score.
  6. Output and verification: Present the results, usually as potential plagiarism “matches” linking the suspicious document to one or more sources. Include excerpts of the similar text highlighted for context. A human examiner can then verify whether it indeed looks like plagiarism. These semantic methods greatly reduce the manual effort by filtering out unlikely cases and bringing true paraphrases to attention.

When implemented, such a workflow can catch cases of plagiarism that would slip past a traditional checker. For example, a student assignment might contain a paragraph whose vector representation is extremely close to that of a paragraph in an online article, even though none of the sentences are identical. The system would flag this. Upon inspection, one may find that the student used different wording but conveyed the same ideas in the same order – clear evidence of plagiarism. In tests on datasets of manually paraphrased plagiarism, embedding-based methods often show dramatic gains in recall (catching many more plagiarised cases) with only a minor hit to precision (rarely flagging unrelated texts).

Comparison of embedding models for plagiarism detection

Each embedding technique – Word2Vec, GloVe, FastText, and BERT – brings unique strengths to plagiarism detection. Below we compare them in terms of effectiveness, resource requirements, and practical considerations:

Word2Vec:

This model is lightweight and fast. It captures general semantics well. This enables detection of paraphrased or synonym-substituted text at relatively low computational cost. Pre-trained Word2Vec embeddings (e.g., Google’s News vectors) are readily available and easy to integrate. The downside is that Word2Vec cannot handle out-of-vocabulary words and cannot differentiate context-specific meanings. Still, it provides a strong baseline for semantic similarity. Many early plagiarism detection approaches relied on Word2Vec and reported significantly improved recall of disguised plagiarism compared to purely lexical methods (Alzahrani et al., 2012). Word2Vec tends to perform reliably for normal prose where most words are common and used in their typical senses. It may struggle if the plagiarism involves highly domain-specific jargon or ambiguous wordplay that changes meaning with context.

GloVe:

GloVe is very similar to Word2Vec in plagiarism detection performance. In practice, they are often interchangeable components. Where differences have been observed, they are minor. For example, one dataset showed Word2Vec with a slight edge while another favored GloVe – likely due to differences in training corpora rather than the algorithms themselves. GloVe vectors (such as the Common Crawl 840B-word set) are widely available. Choosing between GloVe and Word2Vec often comes down to convenience or domain fit. There is no strong evidence that one consistently outperforms the other in catching plagiarism. Both would fail on similar challenges (like a completely unseen term) and succeed on similar ones (catching synonyms, etc.). One could experiment with both, but given their conceptual overlap, usually one is sufficient.

FastText:

FastText often outperforms Word2Vec and GloVe on tasks like short-text similarity and paraphrase detection (Chawla et al., 2021). Its ability to handle rare words and encode character-level nuances gives it an edge. For instance, FastText can recognise that “dopamine” and “dopaminergic” are related terms in a neuroscience text, whereas Word2Vec might not have quality vectors for those without specialised training. Chawla et al. (2021) found FastText to be the most effective model (with highest accuracy and F1) for identifying paraphrases in plagiarised student answers, while also being efficient in memory and speed. FastText’s resource overhead is only slightly higher than Word2Vec (due to storing subword vectors and summing them), and this is usually not problematic. FastText is a strong choice if your plagiarism detector must handle diverse or specialised vocabulary. It is also useful in cross-lingual plagiarism contexts, since subword information can bridge languages with common roots or loanwords (though true cross-language detection often needs multilingual embeddings or translation).

BERT:

BERT and similar transformers represent the state-of-the-art for capturing meaning. They can identify matching content with a degree of precision and recall that static embeddings alone cannot reach. BERT-based methods have near-human performance on paraphrase detection. If two passages mean the same but are written differently, BERT’s embeddings can often recognise this equivalence, whereas even FastText with cosine similarity might falter if wording differs greatly. BERT’s contextual understanding also means it is less likely to be fooled by texts that merely share topic words. Suppose two students independently write about “global warming.” They will both use words like “climate,” “temperature,” and “emissions” frequently. A simple embedding average could show high similarity in this case, even if one student did not plagiarise the other. BERT, by considering full sentences and discourse, might discern that the arguments or phrasing differ, thus reducing false positives. The major limitation of BERT is efficiency. Running BERT on every sentence pair in a large collection is computationally expensive. In development, one should use strategies to mitigate this: reduce the search space with cheaper methods, use smaller models (e.g., DistilBERT), cache embeddings of common sentences, or fine-tune BERT to produce document-level embeddings (for example, hierarchically combining sentence embeddings). Despite these challenges, when maximum accuracy is required, BERT is the go-to choice. In evaluations on benchmarks, BERT-based detectors consistently outperform earlier methods by significant margins (Moravvej et al., 2023). Researchers are also finding ways to integrate transformers more directly into search, for instance by using attention to automatically highlight plagiarised phrases.

Hybrid approaches:

Some systems don’t rely on a single method. For example, Latina et al. (2024) built a hybrid system combining Word2Vec and BERT features. The idea is that Word2Vec (or FastText) can quickly capture obvious similarities, while BERT can dive deeper into subtle ones; together they yield a robust decision. Another approach is to mix semantic and style-based features. One might use embeddings to find content matches and simultaneously check if the writing style suddenly changes (e.g., vocabulary or sentence structure shifts), which can signal inserted plagiarised material. However, blending style analysis with content similarity is complex. Still, in cases of a very skilful paraphraser, if semantic models are unsure, a stylistic discrepancy might raise a flag.

In summary, all these embedding methods significantly elevate plagiarism detection by enabling semantic comparisons. Introducing embeddings can boost recall of paraphrased plagiarism by a large margin (Alzahrani et al., 2012; Chawla et al., 2021) because cases missed by keyword matching are now caught. FastText and BERT stand out: FastText for its efficiency and strong performance, BERT for its top-tier accuracy (given enough resources). A practical system might use FastText for initial scanning and BERT for verifying borderline or high-stakes cases. Overall, plagiarism detection is moving from simple string matching to sophisticated semantic analysis powered by embeddings.

Conclusion

Word embeddings have transformed plagiarism detection by enabling systems to recognise when two texts convey the same ideas, even if written in different words. This semantic approach addresses the core weakness of traditional plagiarism checkers, which could be easily evaded by moderate paraphrasing or synonym replacement. Techniques like Word2Vec, GloVe, and FastText provide a foundation for representing text meaning in a mathematical form. They have proven their worth in identifying paraphrased or obfuscated copies of documents. Meanwhile, contextual models like BERT push the boundaries further, capturing deep semantic and contextual similarities that allow near human-level detection of copied content in disguise.

By incorporating these models, modern plagiarism detectors can uphold academic and scientific integrity more effectively. They not only catch blatant copy-paste plagiarism but also more insidious forms where someone rewrites another’s work without proper attribution. Moreover, these embedding methods are continually improving. We can expect future systems to use even more refined embeddings (perhaps from GPT-style models or domain-specific transformers) to detect plagiarism across languages and even catch translated plagiarism. They might also identify when the flow of ideas in a text is suspiciously similar to another source (idea plagiarism). Challenges remain – computational cost, the risk of false positives on topically similar yet independent texts, and the need for explainability. However, research is addressing these issues by combining semantic analysis with additional checks and optimising algorithms for speed.

In conclusion, word embeddings represent a significant leap forward in plagiarism detection technology. They bring us closer to the ideal of catching plagiarism not through shallow cues, but by truly understanding textual content. As these methods become widespread, would-be plagiarists will find it increasingly difficult to hide behind rephrasing. The arms race between plagiarism tactics and detection methods will continue, but semantic embedding-based detection provides a robust defence of originality. In academia and beyond, this helps ensure that authors who put in genuine effort receive the credit they deserve, and that copied ideas (no matter how cleverly concealed) are brought to light.

References

Alzahrani, S.M., Salim, N. and Abraham, A. (2012) ‘Understanding plagiarism linguistic patterns, textual features, and detection methods’, IEEE Transactions on Systems, Man, and Cybernetics, 42(2), pp. 133–149.

Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017) ‘Enriching word vectors with subword information’, Transactions of the Association for Computational Linguistics, 5, pp. 135–146.

Chang, C.-Y., Lee, S.-J., Wu, C.-H. and Liu, C.-K. (2021) ‘Using word semantic concepts for plagiarism detection in text documents’, Information Retrieval Journal, 24(3), pp. 298–321. DOI: 10.1007/s10791-021-09394-4

Chawla, S., Aggarwal, P. and Kaur, R. (2021) ‘Comparative analysis of semantic similarity word embedding techniques for paraphrase detection’, EasyChair Preprint 5833.

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019) ‘BERT: Pre-training of deep bidirectional transformers for language understanding’, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), pp. 4171–4186.

Kusner, M., Sun, Y., Kolkin, N. and Weinberger, K. (2015) ‘From word embeddings to document distances’, in Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 957–966.

Latina, J.V., Vallejo, E.D.M., Cabalsi, G.M., Centeno, C.J., Sanchez, J.R. and Garcia, E.A. (2024) ‘Utilization of NLP techniques in plagiarism detection system through semantic analysis using Word2Vec and BERT’, in Proceedings of the 2024 International Conference on Expert Clouds and Applications (ICOECA). IEEE. DOI: 10.1109/ICOECA62351.2024.00068

Le, Q. and Mikolov, T. (2014) ‘Distributed representations of sentences and documents’, in Proceedings of the 31st International Conference on Machine Learning (ICML), pp. 1188–1196.

Pennington, J., Socher, R. and Manning, C.D. (2014) ‘GloVe: Global vectors for word representation’, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543.

Reimers, N. and Gurevych, I. (2019) ‘Sentence-BERT: Sentence embeddings using Siamese BERT-networks’, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp. 3982–3992.

Leave a Comment

Find us on: