Plagiarism, considered the most frequent form of academic misconduct, is a complex matter. In established cases of academic misconduct, plagiarism is arguably less frequent than falsification or fabrication, two other forms of academic misconduct, but figures more frequently than either of the other two. This ratio has to do with the complexity of proving plagiarism. Detecting possible plagiarism is done by identifying a potential "source text." Text comparison, however, is a complex affair, in particular, how authors manipulate "borrowed" texts without verbatim copying. There also exist problems associated with defining plagiarism and with determining improper uses of sources.
Finding Common Text Clusters
Plagiarism detection relies on finding textual overlaps or a collection of common "text clusters." Such detection typically consists of three steps:
Step 1: Establish a corpus of comparison materials
Text plagiarism entails, by definition, the existence and improper use of a source text. Therefore, a growing database of writings against which a specific document can be checked at a later stage is required. Four combinable approaches can be used in this respect:
- The Locus-Oriented Approach
- The Reference-Oriented Approach
- The Subject-Oriented Approach
- Track-Covering"-Oriented Approach
Step 2: Classify potential source documents according to the probability of relatedness
Step 3: Identify, quantify, and document common text clusters between two documents