Plagiarism in natural and programming languages


The report is about some of the dissertation work which is how to develop the existing system of student coursework submission. This application has many stages that should to develop. The first, step is how to design a useful website that easy for student to submit their coursework. The second step is how to design a safety database which is important in the application to store the student mark. The student mark here is our ethical issue. After that the system should automatically marking and provides the student feedback for both student and lecture. These tasks are need some tools by which tool to check the program code if plagiarism existing. And tool to run the programs automatically and providing output. So the literature review will be more about the tools that detected plagiarism in the same area.


The main project's idea is how to improve the existing system with a website that could student submit their coursework easily. The existing system is just link that student attach their coursework and press submit button or by email. Some lectures are still used the copy of paper to mark student coursework. Students have to submission their course work in the box. This box will collect a lot of paper that make hard to officer to organize them. Imagine lots of students doing the same course, lectureres present a lot of lectures, so they need more time and effort to mark the student's coursework.the main focusing a bout this report is related work of tools that detected plagarisim.

The aims and descripton of this project are:

Aims - plagiarism is not checked as soon as the program is submitted but should be something that the lecturer can choose to do later.
  • Make available a website and system to collect and manage the submission of the students' coursework of java programs. This website will do all things as human. Automaticlly by some tools to given result of the program submission. Once the program code is submited to the website the website should check first if the programs meet the plagarisim by one of the tools which will see more details late and used. After that the programs will run automaticlly by tool called ant to get outpot from the program. The //debugger/m/ here will be also other tool will be used in this project called Junit. When all things are done will Give a feedback and marks to the lecturer and student by email? Therfore, lecture could easy and fast print the program or other document also fast see if the plagiarisim exists.
The objectives for this project are as follow:
  • Understanding and analysis the existing system.
  • Asking or work with lecture some time to collect nice idea for their structure to mark the student course work.
  • Background/ literature review of other project use the plgarisim tools.and decide which one will use at the end. As well as how ant tool work
  • The requirment is automatic system to debugger student coursework.
  • The design for this system will be both a usefull website with good guidence and database which is very important to safe the student mark. So the usability by intrection the users to make a good design are important

Background /literature review:


For this secession we will meet the familiarity of what we need to the whole dissertation which is the tools are detected the plagiarism in programming code. In fact, these tools have been divided by their work into many parts which will late go through it.

The introduction is not clear enough; it should make it obvious that it is for java programs and not anything else.


It is the part of work has been presented to show its own work and code while these codes were written by somebody else. Therefore, it is not allows to take someone work or idea and present as they were yours without any reference. So from this point there were quite a lot of study and discussion in this area to define and discover tools that could detect plagiarism in any code.

Methods or ways to detected plagiarism [11][10]:

a)-Attribute system counting:

It clear from its name that the metric is just count the attributes as well as it only focus on both comprises of all type in operators and the operands by using the algorithm or the software Halstead. The main aim of the system is to find the similarity value of two programs. This system has some static variable:

n1 is showing the number of the operators.

n2 is viewing the number of the operands.

As same as in the total number of operators and operands in all types but the different is the capital N. So the N1 shows to the total number of operators and N2 shows to the total of operands in all types.

This system has two equations to calculate the metrics

1- V= (N1+N2) log (n1+n2).

2- E=([n1N2(N1+N2)log(n1+n2)]/2*n2)

This system will go through all lines where this line is blank or comment to check

B) Structure metric:

The method has been given more attention and focusing in the structure rather the previous one which focusing in number of attributes. This system split into many tokens that help to detected plagiarism.

The Structure metric has been divided into two phases.

The name of the first phase is called Tokenization:

First phase is similar in all tools that use this system.

Here in this phase will divide all programs into tokens.

This phase composed from:

  1. Comments and string-constants are removed.
  2. Uppper-case letters are translated to lower-case.
  3. A range of synonyms are mapped to a common form.
  4. If possible, the functions/procedures are expanded in calling order.
  5. All tokens that are not in the lexicon for the language are removed. [6]"

The different is in the second phase.

Second phase is called Comparison of token:

Here in this phase the all token will be compared to detect plagiarism.

There are many other tools that detected the plagiarism which are depending on the program's structure rather than indicator of summary.

So the structure metric method is almost found much more trust than attribute counting. Because it is concern the structure rather than the value of these numbers. These are some example of these tools are using the structure which are MOSS, YAP and JPlag. So we will see more details about these tools.

There are lots of tools that detect plagiarism code .However; these tools are work with deferent technique which improves from the past until present.

The first challenge for the plagiarism detection was generally focus in the feature comparison. Because the majority of software systems in those days were focusing on computing the number of different software metrics type [1].

The first system was discover in 1976 developed and call Ottenstein which is just concern the FORTRAN language [2] that uses Halstead metrics. This system is focusing on the single numbers of both the operatorsN1 and the operandsN2 as well as the total of the both numbers N1, N2 that we mention before.

From these thing we could Know if the four numbers values are similar then the program meet the plagiarism [2].

The next system for detected the plagiarism was focusing on the large number of metrics that not more than 24, rather than unique number as in fist system. This system was used to detected plagiarism in Pascal programs. This was very good which improved performance was [2, 3].

What is Moss?

[7][4]Moss is standard for (Measure Of Software Similarity). It is an automatic system that finds the two similarities of programs. It has been developed in 1994 since that it was mostly used to detected source of code which have plagiarism. In addition, MOSS is the automatic software system that supports some language for instance, Java, Ada, ML, C, C++, Scheme programs. It has been much more successful in these languages. However, the Moss tool it is just providing as web internet service.

The internet service has been very simple to use by just list all files that you want to compare an then the rest of thing will moss does. The last version of moss submission is supported Linux. When the moss is finishing the comparing between the files will provide the output as a hotmail page with the list of all pairs of similarity code detected. As well as Moss will providing easy way to make compare of these file by just underline the same thing in the both files [7]. Therefore, it easy to use and any one could use and free so anyone could have account.

YAP: for Michael Wise

YAP, is standing forYetAanotherPlague. It systems divided into three version. this is Michael Wise's tools which is developed at the University of Sydney, Australia. And he defines his own structure as metrics.

The YAP1, YAP2 are the previous version of Michael Wise. So he started to develop the first one YAP1 flowing by YAP2and finally YAP3.

YAP1 [6]: this version is published in 1992. It works with mix of UNIX utilities, joined as one with a Bourne-shell.

This version has some disadvantage with is so slow.

YAP2 [6]: this version was the improvement of previous one and it was much faster .in addition, it was written within Perl this version used a C program and implements the Heckel's algorithm.

YAP3: it is improvement in 1996 to detect plagiarism code which is other person work and providing as its own in computer programs language. As well as other texts has been submission by students [1].

This tool was the final version of Wise's tool. In the latest version he describes the second phases as novel [1].the second phase is depended on algorithm names "Karp-Rabin Greedy-String-Tiling (RK-GST)".which will see more details late in JPag.

This version with its algorithms is much better and able in order to locate the similarity of lines of code that transposed.

YAP3 is still weak in the changing order of code source [6].

Let's move in the jplag.

It is a program which is discovered by Lutz Prechelt and Michael Philippe and Guido Malpohl in 1996. Those people try to providing a system that takes many sets of the code source and discovers the similarities of them. It does not only compare the byte of text also it knows the program's structure plus the syntax of programming language. The existing languages that JPlag support are Java, C, C++, Scheme as well as the natural language text. So it used to detect the similar exercise copies of student programs rather than looking for copies of internet programs. In addition, it has a good graphical user interface to shows the output as result of the survey. This interface is a nice HTML of the result and it has nice other window to underline and show the compare between both files are equal.

The algorithm that JPlag used:

There are two algorithms that JPlag used.

  1. The Greedy String Tilling algorithm (GST).
  2. The Rabin Karp (RK).

Mostly the Greedy String Tilling algorithm is used. On the other hand it is more complex when the algorithms running especially on O (n^3) which is the large notation. It is the bad or hard case cannot decrease. The main design of both Greedy String Tilling (GST) and the Rabin Karp (RK) are to change the programs code into tokens. However in the (GST) will immediately the token is compared because the compare is one to one. Also (GST) will find and detect the two similar text or pattern even if they are in different position as in figure (3) give you an idea about the code of the (GST) algorithm. It is clear that in the point 11-15 how the algorithm work and how to count the set of similar. The 'text' here meaning is long String while the shorter string here called 'pattern', while in the RK every token should calculate the hash value which is important in the RK, because it will compare all value at the end. It is the best method to apply when the string is to lone is the hash value is the greatest [12].

The RK has a good technique to detected plagiarism which the hash value calculation. So that the reason for it. First it makes calculated for the Pp substring an then the hash value of the Pp+1. That meaning the first will calculate the hash value of the pattern after that will calculate the size of pattern's text. This calculation will be kept or save in the hash table. So from this table will be easy for the hash value to make comparison between them. If they are have same value then the sub string will be compared otherwise the difficulty will be a linear with the size of both strings pattern and text. For this problem also have two methods to work with if there is linear.

  1. 1-ScanPattren ().
  2. 2-MarkArrays ().

4-Plague [1]

It is from the same structure metric type of the technique and the work, method are same in the YAP3 used. However, Plague does not use the RKR-GST algorithm as in YAP3. It has three phases using in order to work:

  1. Create a sequence of token and a list of structure metrics to form a structure profile. The profile summarizes the control structures used in the program and represent iteration / selection and statement blocks.
  2. An O (n2) phase compares the structure profiles and determines pairs of nearest neighbors.
  3. Finally, a comparison of the token sequences using a variant of the Longest Common

Subsequence made for similarity" [1].

But in summary, as Clough continues, Plague suffers from a number of problems, these include

  1. The fact that it is hard to adapt to new languages.
  2. Because of the way it produces its output (two lists of indices that require interpretation),

Results are not obvious.


This tool was developed in 1994 at the University of Warwick. Also Sherlock is an application independent and it is easy to use by just go online in boss submission system.

This system could compare both source code and texts. As well as it has present the result as HTML graph form. The algorithm that Sherlock used is same as in YAP3. It also uses the tokens to search for the sequence line by line that same in both files this search is called runs. The technique that Sherlock used is to looking for length runs. Sherlock does not have a website service based like Moss and Jplag therefore it is individual tool.

The Usability section - finish this by including details of guidance on user interaction design (one place to start is Heuristics by Nielsen)

What does usability meaning?

[13][14]From it names meaning that to make system or product and provided to the user as easier as possible to use. It also has other meaning is evaluation or "easy to use". Also we could identify the usability that designer should simplify problems. It is important the user in this situation to make the system simple. The programmers have to interact with user all time. Other meaning it is a center or focusing in human's computer interaction [HCI] between the efforts to be clear means in users. In addition usability has other name which is the capacity that humans used. As well as the majority researcher agree that the usability is context dependent in addition to the shaped. Also we could say the usability is an interaction among the users and the problems. In our work we will consider the usability of the web site design which it is Nowadays much more important. On the other hand to make guarantee that web site has a good quality it should have a very good recognized asset. People could normally leave web site that obtain hard to navigate or has dead links. In addition, some web has a lot of the information in the home page which makes users difficult to locate the right link. Therefore, designers give an attention to the usability as quality of factor in their design. At the beginning of the usability was used on the paper .However, it is quickly has been changed to the form of hypertext.

The ethical or the social issues for the project:

The most importance in this project is the student marks after their submission the programs will mark the course work automatically and produce the result. Now these marks are important in the lecturer and lecture should only have access to the database to see or change the information. Therefore how to keep this marks safety?

It should make good databases in order to keep all the students mark safety.

Database is also called database security. Nowadays is increasing of using computer program in everyday life leading programmers or designers to pay more attention into the database and it composed from a group of information has been connected

There are many kind of databases some of them are strengths and others weaknesses. Here are some types of database:

  1. Microsoft Office Access.
  2. Microsoft SQL Server.


  1. P. Clough, "Plagiarism in Natural and Programming Languages: An overview of current tools and technologies". Internal report, Department of Computer Science, University of Sheffield, 2000
  2. S. Grier, "A tool that detects plagiarism in Pascal programs", ACM SIGCSM Bulletin, Vol. 13, No. 1, 1981, pp. 15-20.
  3. J.L. Donaldson, L. Ann-Marie, and P.H. Sposato, "A plagiarism detection system", ACM SIGCSE Bulletin.
  4. A. Aiken, "Measure of software similarity", URL
  5. M.J.Wise, YAP3: improved detection of similarities in computer programs and other texts, presented at SIGCSE'96, Philadelphia, USA, February 15-17 1996, 130-134.
  12. Wise, M, J. Running Karp-Rabin Matching and Greedy String Tiling University of Sydney (1993).
  13. Using the Metro Web Tool to Improve Usability Quality of Web Sites

Please be aware that the free essay that you were just reading was not written by us. This essay, and all of the others available to view on the website, were provided to us by students in exchange for services that we offer. This relationship helps our students to get an even better deal while also contributing to the biggest free essay resource in the UK!