Comparable outcomes, grade descriptors and exam grading

by therealityofschool

There has been some criticism by ASCL of Ofqual’s policy of requiring exam boards to use comparable outcomes as a way of determining the rationing of exam grades – how many A*s, As, Bs etc.

Once all the scripts have been marked and moderated they have to be given a grade. The grading is done by the exam boards – often called Awarding Organisations or AOs. The AOs are supervised by Ofqual (the Office of Qualifications and Examinations Regulation, based in Coventry). One important objective of Ofqual is to maintain grade standards over time, ensuring that the scripts being graded this year would have got the same grade had they been set, marked and graded last year. It is not simply a matter of saying ‘over 80% is an A grade’, because this year’s papers may be slightly harder than last year, in which case the 80% rule would be unfair on this year’s candidates. So the examiners should take a close look at the scripts from last year which were just below and just above each grade boundary and compare them with this year’s scripts.

Nor is it simply a matter of saying ‘x% of candidates got an A grade last year so we should give the same % an A this year’. After all, the candidates work this year might be better – over time, teachers become more skilled at teaching a given syllabus, or schools might start diverting weaker students to other subjects where they have a greater chance of success. Or it might be that candidates get worse, because schools start putting much younger children in for an exam (as some schools did with GCSE in recent years) or because the best students have moved away to other subjects or qualifications. We know that some bright children find GCSEs rather undemanding, which is why independent schools have put them in for a slightly harder alternative, the iGCSE, instead – so some of the best students have been removed from the group taking GCSEs.

Where a new syllabus has been introduced, it is quite possible that it will be slightly easier or slightly harder than the previous syllabus. But because the new syllabus is different from the previous version, it is not as easy to simply compare this year’s scripts with last year’s. Furthermore, experience tells us that results often dip for a year or two when a new syllabus comes in because teachers are less used to it. This is potentially unfair on these candidates, so when a new syllabus starts Ofqual asks exam boards to operate on the ASSUMPTION that the grade distribution this year will be very similar to last year if the cohort is similar.

The way they check whether the cohort is similar by comparing their Key Stage 2 tests scores in maths, English and science. This is called the ‘comparable outcomes’ approach and is meant to achieve fairness to candidates from one year to the next. However, it is not a firm requirement – if an exam board can prove to Ofqual that standards really have risen this year, then it can give higher grades. And of course it only applies to the grade distribution for the cohort as a whole, not to individual candidates! A child who did well in KS2 but was lazy for the next five years would still do badly at GCSE.

Of course the use of KS2 data could be criticised as the basis for this comparison – it is five years old and most pupils in independent schools do not take KS2 tests so the data for them is missing. Nevertheless, we need some basis for determining the likely quality of a given cohort and the KS2 test results are the best we have got at the moment.

The first sitting of the reformed GCSEs will be in summer 2017. From this year onwards Ofqual will do comparable outcomes by making use of national reference tests – tests taken in English and Maths by a large sample of children shortly before they sit the GCSE itself. Marks in the national reference test will be aligned to the new GCSE grade boundaries. The same test will be used every year so it will be possible to see whether standards in Maths and English have really risen or not.

It is important to understand that the statistical framework within which grades are awarded is unlikely to have much bearing on the results of any one pupil or any one subject in one school. 700,000 children sit GCSEs every year. In 2014 almost 6 million subjects were sat….11 million exam scripts. Human judgement is central to the grade achieved by every child, but with such huge numbers a statistical framework which acts to ensure fairness from one year to the next is needed.

Ofqual also try to align grade standards across awarding organisations (they need to ensure that a grade A for this exam board means the same thing as a grade A for all the other exam boards) and they need to ensure that this is also true across all the different syllabuses an exam board may offer for any given subject. This is a difficult task! Ofqual does its best, sampling as many exam boards, subjects and scripts as it can – but it has to do this quickly, in between the completion of the marking/grading and the deadline for publishing the results, so it cannot consider every subject offered by every board every year.

Ofqual also tries to ensure that the grade distribution for different subjects is fair. Only more able pupils take Ancient Greek GCSE and it would therefore be right to expect more high grades in Greek than in, say, Art. The best scientists take separate Physics, Chemistry and Biology GCSEs so they should get more high grades that those taking Combined Science. Of course Ofqual and the exam boards can use the Key Stage 2 data to test these assumptions about the ability of the cohort. But as with everything in exams, the decisions are not straight-forward. For example, it is in reality impossible to compare the quality and level of difficulty of a painting with a piece of Greek translation – they are not comparable.

So the allocation of grades is based on a combination of Ofqual’s statistical recommendations AND examiner judgement.

How does the examiner judgement bit work?

GCSEs have “grade descriptors” for each subject. These give “a general indication of the standards of achievement likely to have been shown by pupils awarded particular grade.” A student may not have shown one aspect of the grade description but would still be worthy of the grade if they showed a better than described performance in some other aspect. Here are part of the grade descriptors for Geography GCSE:

Grade A scripts:
They apply appropriate knowledge and understanding of a wide range
of geographical concepts, processes and patterns in a variety of both
familiar and unfamiliar physical and human contexts. They recognise and
understand complex relationships between people and the environment,
identifying and evaluating current problems and issues, and making
perceptive and informed geographical decisions. They understand how
these can contribute to a sustainable future.

Grade C scripts:
They apply their knowledge and understanding of geographical concepts,
processes and patterns in a variety of both familiar and unfamiliar physical
and human contexts. They understand relationships between people
and the environment, identifying and explaining different problems and
issues and making geographical decisions that are supported by reasons,
including sustainable approaches

If you can imagine applying these to exam scripts, you will see the difficulty with grade descriptors!

Some grade boundaries are defined by EXAMINERS’ JUDGEMENT. To identify the boundary mark for each of these JUDGEMENT GRADES the procedure followed by the exam boards is as follows:

First, taking a sample of scripts whose marks are believed to be close to, say, the A/B grade boundary, a group of experienced examiners look at the scripts starting with those with the highest mark and working down. They try to agree on the lowest mark for which there is consensus that the quality of work is worthy of an A rather than a B. This is called the upper limiting mark.

Next, taking scripts which have marks which are a bit worse than the upper limiting mark and working up from the bottom, they identify the highest mark for which there is consensus that the quality of work is not worthy of the higher grade (an A). The mark above this forms the lower limiting mark.
The chair of examiners must then weigh all the available evidence – quantitative and qualitative – and recommend a single mark for the A/B grade boundary, normally within the range including the two limiting marks.

The chair of examiners makes the recommendation to the officer of the exam board with overall responsibility for the standard of qualifications. That officer may accept or vary the recommendation and will subsequently make a final recommendation to Ofqual. Ofqual may approve it or give reasons why they are not happy with it. In the latter case the exam board must reconsider and produce a final report. Ultimately Ofqual can direct an exam board to prevent it setting what it believes to be unjustifiably high or low grade boundaries.

Stage 1
The A/B grade boundary, the C/D boundary and the F/G boundary are set by examiners using two types of information:
*statistical evidence such as the proportion of the candidates who might be expected to get an A grade or better based on the proportion who did so last year and whether the Key Stage 2 test results for that cohort suggest they are brighter or less bright than the previous year’s cohort,
*examiners’ judgement based on the quality of the scripts in front of them, using grade descriptors published by the exam boards and a knowledge of what a grade A script (for example) looked like last year.

Stage 2
The top mark for a B and the bottom mark for a C have now been defined. The B/C boundary is simply set as the middle mark between these two.
The top mark for a D and the bottom mark for an F was defined in stage 1. The D/E and E/F boundaries are set by simply dividing the range between these two marks into three.

Stage 3
The A*/A boundary is set as many marks above the A/B boundary as the B/C boundary is below it.
The G/U boundary is set as many marks below the G/F boundary as the E/F boundary is above it.

In other words, some of the grading is done arithmetically, not on the basis of examiner judgement.

Ofqual are particularly concerned to stop an exam board being too generous with grading because that would be unfair on students taking other exam boards and would be unfair on candidates from last year who might not have been treated so generously. Furthermore, if exam boards are a bit generous every year, then grades gradually get better over time for no very good reason – grade inflation. Grade inflation is bad because it tends to make people cynical about the value of exam grades and, if too many students have top grades it becomes harder for universities and employers to select students on the basis of exam results. ‘All must have prizes’ may seem attractive to some, but it is unfair on the hardest working and most able students whose exam results come to mean less and less.