Exams in English schools

by therealityofschool

 

Exams are an essential element of a child’s education.  One reason is the tremendous value of committing knowledge and information to the long-term memory.   For most children, carrying what they have learnt in school into adult life depends in large measure on them being forced to memorise it.  A typical average ability 16 year old boy can reel off 200 or so words in French three months before he sits the GCSE.  On the day of the exam that figure has grown to 1000+ – all driven by fear of the exam.

Exams put pressure on children, and that is their great virtue.  Girls are more likely to want to please the teacher and are therefore more motivated during the course.  Boys do not especially want to please teachers – in my experience of teaching boys, 80% are relatively idle during the term but most make a big effort preparing for exams.

Exams are the essential building block of motivation.  Ask any teacher who has had to teach an unexamined course to 15-year olds, as many schools used to do with Religious Studies.  It was a hapless task, and almost all now insist pupils take the RS GCSE as a way of improving pupils’ attitude in lessons.  Anyone who thinks that exams are a bad thing has never taught a class of teenage boys.  Exams work because they make pupils work.

The age at which pupils are required to be in education or training in England has risen to 18 so why do we need exams at all at age 16?  Because in the English system we typically drop down from ten GCSE subjects to three or four A-levels at that age.  On average one of those A-levels is a subject not done at GCSE, so most pupils drop about seven subjects at the age of 16.  It is vital that, having studied these seven subjects for up to twelve years, pupils be examined in all of them in order to consolidate what they know and measure their progress.

I have quite often heard people who should know better say ‘other countries don’t have GCSEs’.  They are wrong.  If you look at the highest performing countries there is a range of models.

For example, in the lower secondary years in Japan students study Japanese, Mathematics, Social Studies, Science, Fine Arts, Foreign Languages, Health and Physical Education, Moral Education, Industrial Arts and Homemaking. In the third year of lower secondary (age 15) they take a national examination in Mathematics and Japanese. Their schoolwork across other subjects is formally assessed by teachers in order to be awarded the Lower Secondary School Leaving Certificate. This is their equivalent of GCSEs.

Singaporean students typically take between six and 10 subject examinations at either O Level (after five years of secondary schooling) or N Level (after four years).

Exam results are the necessary qualification for moving to the next level.  We do not want pupils embarking on A-levels unless they have a GCSE performance which suggests they might achieve something worthwhile.  We do not want students embarking on a medical degree if they cannot get an A grade in Chemistry – they would be too likely to fail.

The alternative to exams is continuous teacher assessment.  In England in recent years we experimented with teacher assessment and it was disastrous.  Many teachers hated it because they came under huge pressure to get good marks for all pupils (where do you think grade inflation came from?) and because these ‘controlled assessments’ were intensely dull.  The academic year became dominated by dreary teacher-assessed coursework.

Pupils in successful countries take exams.  They force children to place the knowledge they have been presented with into the memory.  Once in the memory new things start to happen in the brain – like analytical thinking and the creation of links between different bits of knowledge.  Educated people know things and the reason they know things is not simply because ‘they have been taught it’.  Far too many children are taught things but know nothing.  The essential step in the process is commitment to memory.

Of course exams cause anxiety and distress but those who think children should never be challenged in this way are the enemies of good education.  Teenagers, and especially boys, have to be driven to succeed.  Exams are that driver.

 

Exam Boards

In England we have an unusual state of affairs whereby more than one exam board offers GCSE and A-level exams in each subject. Ministers have been worried about this because the various exam boards compete with each other by offering easier syllabuses, easier questions and more generous grading.   Ofqual has to try to stop this happening.

 

Why have we got multiple exam boards?  They arose because in the nineteenth century schools and universities decided that they wanted external exams.  The government wasn’t offering to set them up so the universities stepped in as follows:

1857 Oxford Board

1858 Cambridge Board

1858 Durham Board

1902 London Board

1903 Manchester, Leeds and Liverpool combined to run the Joint Matriculation Board.

 

After 1987 exam boards merged and there was a gradual disengagement of the universities; Edexcel, for example, was taken over by Pearson from London University in 2003.  Today there are three main exam boards in England.

 

A future government could decide to run all exams itself or might demand that each subject qualification is run by just one exam board.  This is opposed by those who fear the loss of competition might drive out innovation.  There would be concerns about politically motivated fiddling with exams by governments and anyway governments themselves may well balk at the idea of running something as complex and politically sensitive as an exam system.

 

Grade inflation

Between 1990 and 2012 there was grade inflation at both GCSE and A-level. The proportion gaining A*/A at GCSE rose from 11% in 1990 to 21% in 2015.  The equivalent figures for A-level were 11% and 26%.

This was caused by a number of things.  There were several exam boards in competition with each other for customers and, in what was later described as a race to the bottom, they competed by offering ever-easier exams.  The exams regulator foolishly introduced a system of modules whereby the exam was broken up into parts and each part could be taken at any point and resat as many times as was needed.  So whereas the A-level had once been, say, three long papers sat at the end of two years now it was six papers, three of which would probably be sat in Year 12 and resat once or twice in Year 13.

Teacher-assessed coursework was another reason for grade inflation.  By 2012 many GCSEs were available where 60% of the marks were for coursework.  Coursework is much easier to manage than examinations – the teacher is always able to help candidates gain better marks.  A judicial enquiry into the 2012 English GCSE results revealed that a huge proportion of candidates who did badly in the written exam did well in coursework and many did just well enough in the coursework to tip them into a C grade.

And finally, as exams mature they become better resourced and teachers become better at teaching the course.  Nevertheless, there was an element of self-deception about the rise in grades during the Blair-Brown years as the Government talked about the improved grades as evidence of rising standards.  But standards were not rising.

 

Standards in English schools

Between 1918 and 1951 pupils aged 16 took a School Certificate started overseen by the Secondary Schools Examinations Council.   The School Certificate Examination was usually taken at age 16 with performance in each subject being graded as Fail, Pass, Credit or Distinction. Students were required to gain six passes including English and mathematics in order to obtain the certificate.

Some students who passed stayed on at school to take the Higher School Certificate at age 18. In 1951 these Schools Certificates were replaced by O-levels and A-levels.

Before 1965 only the top 20% of the ability range (those in grammar schools and independent schools) took O-levels and went onto A-levels, the rest (those in secondary modern schools) leaving school without qualifications.  In 1965 the more vocational Certificate of Secondary Education was introduced for 16 year-olds; if the O-level was for the top 20-30% of ability, CSEs were for the next 40-50%.

 

With the advent of all-ability comprehensive schools in the 1970s it became clear that a system which required schools to divide the population into two – O-level or CSE – was unsatisfactory.  40% of those taking O-levels failed and many of those taking CSE could in fact have managed O-levels – so students in both groups were being misclassified by their schools.  In 1986 O-levels and CSEs were merged to create the GCSE – the General Certificate of Education.

 

The GCSE is designed to be accessible to the bulk of the school population.  For this reason some of the questions have to be easy so that the least able can gain some marks and thus a grade.  It is not fair to compare the easiest questions in a GCSE with an O-level – they are designed for different groups.  However it IS fair to compare the hardest questions at both GCSE and A-level with those set in the past and this can be done by looking here: 1974 Exam Papers  and then looking at current past papers on exam board websites such as http://www.aqa.org.uk/exams-administration/exams-guidance/find-past-papers-and-mark-schemes.

 

Such a comparison shows that some subjects have been dumbed down at both GCSE and A-level, most obviously modern foreign languages and sciences.  The hardest questions in a subject like history are not very different.  Indeed, when the first GCSE results came out in 1988 more academically selective schools saw their results shoot up – the GCSE was much easier.

 

PISA research in 2013 found that England was the only country in the developed world where the generation approaching retirement was more literate and numerate than the youngest adults: adults aged 55 to 65 performed better than 16- to 24-year-olds at foundation levels of literacy and numeracy.

 

The main problem with English schools is the long tail of underachievement.  In 2015 54% of pupils gained five GCSEs grade A* to C including English and Maths.  54% is a low figure – after all, a C grade in a GCSE is not a great achievement for many pupils.  For those on free school meals the figure was 33.1%.

 

In 2015 69% of all GCSEs were passed at grade C or above but 37% of those who passed only achieved a grade C – a bare pass. So many pupils are scraping by.  Even these figures flatter the ability of the students because gaming and teaching to the test push many up to a C grade.

 

The reformed GCSE Maths and English are tough but the comparable outcomes approach to grading preserves the numbers who ‘pass’: the % mark required to get a grade 4 (the equivalent to a C in old money) is now quite low.  So the true tail of under-achievement in England is longer than we think.

 

Gaming

Governments require vehicles to pass mandatory exhaust emission tests before they can be sold but several times it has been found that some who had passed the test before they left the factory gave off much more pollution in normal road use. The  Volkswagen emissions scandal, emissionsgate, erupted in 2015 when the United States Environmental Protection Agency found that Volkswagen had intentionally programmed diesel engines to activate certain emissions controls only during laboratory emissions testing.   The manufacturers had designed engines which were polluting but which could nevertheless pass the test.

After about 2008 schools in England  started talking about a private company called PiXL (partners in excellence) which perfectly legitimately taught strategies to ‘game’ the system.  For example, they told schools to identify all their year 10 pupils who were likely to gain a C in maths but a D in English.  They then encouraged the schools to cram them for an early sitting in maths.  Having passed maths in Year 10 or early in Year 11 they then used the Year 11 maths lessons for extra English.  This policy maximised the number of pupils gaining C grades in both Maths and English.  This focus on the C/D grade boundary was itself a result of the government’s own ‘floor standards’ which required schools to get their pupils five GCSEs grade A* to C including English and Maths.

Another example of gaming was the discovery by PiXL that lower ability pupils were more likely to pass English if they took the international GCSE rather than the conventional GCSE.  One reason for this was that the iGCSE retained coursework and marks for an spoken English test.  Following their discovery huge numbers of less able students were transferred to the iGCSE.

Gaming of the sort sponsored by PiXL gives an illusion of progress.  I would never blame schools for using their methods – I would do the same – but where gaming works it produces an improvement in pupil’s grades without improvement in knowledge or skill.

Another example of gaming has been those schools gaining extra time for students by requesting ‘reasonable adjustments.’  JCQ (the Joint Council for Qualifications) agree the so-called access arrangements.  In the past candidates have gained extra time for minor problems, such as ‘exam phobia’, for which the evidence was limited.  Perfectly able candidates have used word processors in exams because it was ‘their normal way of working.’

A final example of gaming was the European Computer Driving Licence, an IT qualification which the government foolishly agreed should have the same value as a GCSE.  It was relatively easy and could be taught, according to some people, in a weekend.  For this reason it became immensely popular, with all year 11s taking it in some schools.

Gaming is the inevitable result of high stakes accountability – if exam results matter greatly, people will always look for short-cuts to success.  But gaming reduces the validity of a qualification – you may think that a pupil who scrapes a C/4 (pass) in English is quite literate but you would be wrong.

Gaming is an example of Campbell’s Law: the more a social indicator is used for social decision making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.    

Teaching to the test

In 2010 I was short of a religious studies GCSE teacher in my school so I put myself up to teach the subject, for the first time.  An analysis of past papers soon revealed that some topics came up more often than others and certain specific questions came up more often than others.  Being short of time and anxious to prove my competence I was therefore encouraged to teach to the test by focussing on those areas.  My pupils did well enough, but I knew if they had received a different exam based on other parts of the syllabus, the parts I had deemphasised, their results would have been different.  

Teaching to the test may or may not be gaming, but it is something which produces results which can flatter pupils.  It gives a false impression of their knowledge across the whole syllabus or domain.

Exam reform in England, 2011-18

Ofqual (the Office of Qualifications and Examinations Regulation) was set up in 2010 and Glenys Stacey was chief regulator from 2011 to 2016.

In 2011 Ofqual announced their hostility to modules (independently graded exam papers): it was impossible to grade fairly if there are many routes to one qualification through modules.  In any one year exam boards were being asked to rank students, some of whom had taken all the modules in one sitting, others of whom had spread them out over several years.  Some had taken a module once, others had taken it four times.

Then they announced their concern about coursework.  Some was never moderated (ie checked by an independent person) including the crucial English GCSE speaking and listening module. Teachers admitted to Ofqual that they had been under pressure to influence their pupils’ results.

With exams you normally like to have a range of marks so that everyone doesn’t get the same grade.  But coursework marks were often bunched at the top end of the scale – which meant that the coursework did not contribute to the necessary range at all.

Further analysis by Ofqual revealed that much coursework didn’t measure what it claimed to.  For example, fieldwork in Geography was supposed to measure the ability to collect and analyse data but in fact it measured little more than an ability to follow instructions given by the teacher.

Coursework in GCSE mathematics and science was also felt by most teachers to be of limited value and burdensome to administer.

At the same time employers and universities were complaining about the quality of their 18-year-old employees and undergraduates: their English and maths were poor, they lacked initiative and they appeared to have gained good exam results by spoon-feeding.

The Department for Education shared this concern about low standards, about the way in which pupils were stacking up marks by taking modules every six months over a two year period, by sitting and then resitting of modules, and by the generally low level of some syllabuses.  There was particular concern about A-levels: the content of modules taken in year 12 were long forgotten by the time the students arrived at university.  The modular system meant that at no point did students know the whole syllabus.

 

So between 2011 and 2015 a number of decisions were taken by the Department of Education and Ofqual that amounted to a radical shake-up of the whole system. 

1 They scrapped January exam sittings so halving the number of times a pupil could sit exams.

2 They scrapped modules.  The AS-level exam was decoupled from the A-level so that the A-level was now linear – all A-level papers are sat in one go at the end of the course.

3 They told schools that the first sitting of a GCSE would be the only one which would count for performance table measures.  This discouraged early and multiple sittings of an exam.

4 In English GCSE the speaking and listening would no longer count towards the main grade (but it would be reported as a separate grade).

5 Coursework was scrapped in all public exams unless it measured something important that could not be measured by an exam.  In A-level sciences the only element of practical work now assessed by the teacher is the student’s ability to select the right equipment, use that equipment and log the results.  At GCSE and A-level the results and meaning of the experiments are assessed in the written exam with questions worth 15% of the total marks.

6 All GCSE syllabuses were rewritten by panels of teachers and subject experts under the control of the Department for Education.  The focus was on raising standards to the levels being achieved in the highest performing areas, such as Singapore, especially numeracy.   Mathematics GCSE became significantly harder.  Subjects such as geography, physics, chemistry, biology and design technology contain more maths.

7 In the past several GCSE subjects were tiered – that is, there were easier papers which could only lead to lower grades or harder papers which allowed candidates to access the full range of grades. The problem was that too many pupils were put in for the wrong tier, including fairly able pupils taking the lower tier.  So Ofqual greatly reduced the number of tiered exams.

8 In 2017 England started the process of moving from alphabetical GCSE grades (A* to G) to numerical grades (9 to 1). They did that because the old alphabetical scale had things wrong with it but it would be hopeless moving from one alphabetical system to a different alphabetical system….that would have been very confusing.  The old alphabetical system was broken for three reasons. First, because of grade inflation too many students were getting A grades. By ‘too many’ we mean so many that the more selective universities were unable to use GCSE grades to distinguish between applicants. All the many applicants they had to choose from had similar grades. Secondly, the introduction of the A* grade was supposed to allow universities to distinguish the outstanding from the very good, but even this grade had suffered from grade inflation. Anyway A* is a pretty silly grade.  Finally, the alphabetical grades divided those passing the exam into four: C, B, A and A*. This was not discriminating enough, so the new numerical system divides them into six: 4, 5, 6, 7, 8 and 9.  The alphabetical system divided those failing the exam into four- D, E, F and G, which was too many for relatively few candidates so this was reduced to three number grades, 3, 2 and 1.

9 All A-level syllabuses were rewritten by panels of university staff so that they were a better preparation for university degree courses.  Universities should no longer be able to complain that students came up to university unprepared.  The sciences introduced more mathematics, the geographers more fieldwork and more physical geography.  The modern linguists produced a syllabus which included more literature and more about the culture of the country whose language was being studied.  In maths the syllabus was arranged so that all students took the same papers rather than choosing from options.

For both A-level and GCSE the reformed specifications were quite detailed.  Ofqual ensure that exam board syllabuses faithfully reflect these specifications – if they do not they are sent back to be revised.

10 In 2016 the Sainsbury Review looked at vocational qualifications for pupils aged 16 and above.  It recommended that the chaos of 20,000 existing qualifications should be reduced to 15, each run by just one awarding body.

 

How reliable are public exams in England?

Exams sample a pupil’s knowledge and are thus liable to all the same weaknesses as all sampling.

Every year many schools experience some exam results which are obviously wrong and have to be adjusted on appeal.

There are enough errors in any year to make us concerned.  In 2015 at GCSE 5.6 million subjects were sat and 62,000 grades lifted on appeal.  At A/AS level 2.4 million subjects were sat and 28,600 grades lifted on appeal – a small number in relative terms, a large number in absolute terms.  There is some evidence there are not enough good markers.

JCQ is managing efforts to recruit more good markers.  If schools realise that marking is excellent training for those teaching reformed exams the number of markers may rise.

In 2014 Ofqual researched the quality of external exam marking and found that it was generally good.

They found that most markers were experienced and well trained.  They found that the use of ‘item-level’ marking (each marker marking just one question) increased accuracy.  They found that screen-based marking also increased accuracy because seed answers (standard answers which had been marked by the chief examiner) are injected into the set of answers at regular intervals to check the marker is accurate and consistent.

In 2015 many of the problems identified by schools were not in fact about bad marking, they were about grading.  This was true of the Cambridge Assessment iGCSE 0500/0522 English.  Here there was a sudden increase in the number of candidates, mainly lower ability candidates from state schools.  When the cohort taking an exam changes dramatically it is liable to make grading harder because an exam board can no longer rely on the ‘comparable outcomes’ approach.  Relying on examiner judgement, which is what teachers do every day when marking work, is in fact not as reliable as we would like to think it is.  All the research shows this: different experienced examiners give the same script different marks.   Cambridge Assessment made the problem worse by setting papers which generated a small mark range: the difference between each grade was small so very similar candidates achieved different grades.

In general terms the key to a good exam is the question setting and the mark scheme.  The questions have to be such that students at each grade identify themselves: easy questions for the less able, very hard for the most able.  The mark scheme has to be very clear so that markers know exactly what mark they should be giving.  Poor marking is often the result of a poor mark scheme.

In 2016 Ofqual reformed the remarking/appeals system

Ofqual conducted experiments to test a number of different remarking methods in order to find the most accurate, concluding that the current system of remarking was no better or worse than the alternatives but could be improved further.

They found that remarkers were usually generous so requesting remarks often yields a result.  This is unfair on candidates who do not appeal. From 2016, therefore, exam boards may only raise a mark on appeal if the original mark is not one an experienced examiner could have given – it is an ‘unreasonable mark.’

This brings us to the concept of marking tolerance.  It has long been the case that for longer answers there is an accepted range of possible marks.  If three good markers mark an A-level history essay the range of possible marks might be 16, 17 or 18 out of 25.  So there is a three mark tolerance for this type of essay answer. Remarkers do not necessarily use tolerance when deciding whether in their judgement a mark could reasonably have been given, but the concept is nevertheless relevant to their work.

 

Differences between different exam boards are a source of unfairness

Ofqual does a good job trying to ensure that all exam boards are working to the same standard for any given subject.  However, it is an imprecise task when one is comparing completely different exam questions.  We have been plagued over the years by exam boards competing for market share by ‘dumbing down’ – easier syllabuses, easier questions, more generous marking, more generous grading.  The situation has improved since 2011 as Ofqual has taken a harder line.  The situation will improve further from 2015 as the reformed GCSE, AS and A level specifications bring exam boards into line and Ofqual makes a big effort to ensure that syllabuses and sample papers are all of a similar standard.

None of the above issues may be as profound as the issue of inter-subject comparability

Some subjects are easier than others.  The Centre for Evaluation and Monitoring at the University of Durham has been publishing materials about this for many years.  For example, if you look at the average GCSE grade of pupils getting an A grade in an A-level, you find it is lower for some subjects than others.  The ‘hard’ subjects include Latin, Maths, Further Maths, Physics, Chemistry, modern foreign languages.  So pupils taking these subjects get lower grades at A-level than they would have done had they taken the ‘easier’ subjects.  At present there is no requirement for exam boards to align subjects by degree of difficulty (partly because degree of difficulty is so hard to define).

The problems with having ‘harder’ and ‘easier’ subjects at both GCSE and A-level are:

  • not all users (universities etc) will have the knowledge to discriminate between them.
  • performance tables don’t discriminate so if some subjects are ‘easier’ they will become more popular for no very good reason.  And then schools which do lots of easy subjects appear better than they are, misleading parents and Ofsted.
  • the UCAS tariff doesn’t discriminate – all subjects are given the same points score.  This is a problem for degree subjects like law where admissions tutors might accept any A-level subjects.
  • we don’t want to choke off ‘difficult’ subjects such as modern languages and sciences which the nation needs.  We know there has been particular concern about modern languages at A-level in recent years, partly perhaps because of the number of native speakers taking these A-levels.  (This is a particular problem with Mandarin where most of the students sitting the A-level in the UK are Chinese.)  Science bodies have shown that pupils with any given set of GCSE grades tend to get worse science A-level results than they would in other subjects.
  • at school level, teachers of easy subjects might get more pay and promotion than teachers of harder subjects simply because their results look better.
  • pupils kid themselves that they are ‘best’ at the subjects in which they get the highest grades and make university subject and career decisions on this basis.

There are a number of things which could be done.  You could simply publish the inter-subject comparability data to the users (universities etc) and leave them to adjust their offers accordingly.  Some countries, such as Hong King and Australia, adjust the grades of optional subjects using a formula which relate the scores achieved in compulsory subjects to those achieved in options.  You could continue as now but publish a SECOND grade which is adjusted.  You could continue as now but publish a rank as well as the grade.

But the fact will remain it is very difficult to compare the ability required to do well in subjects like Art and Physics.  They require very different skill sets.

 

Grading by using comparable outcomes

 

One important objective of Ofqual is to maintain grade standards over time, ensuring that the scripts being graded this year would have got the same grade had they been set, marked and graded last year.  This is another way of saying that they want to stop grade inflation because it undermines confidence in the system.

It is not simply a matter of saying ‘over 80% is an A grade’, because this year’s papers may be slightly harder than last year, in which case the 80% rule would be unfair on this year’s candidates. So examiners take a close look at the scripts from last year which were just below and just above each grade boundary and compare them with this year’s scripts.

Nor is it simply a matter of saying ‘x% of candidates got an A grade last year so we should give the same % an A this year’. After all, the candidates work this year might be better – over time, teachers become more skilled at teaching a given syllabus, or schools might start diverting weaker students to other subjects where they have a greater chance of success. Or it might be that candidates get worse because schools start putting much younger children in for an exam (as some schools did with GCSE in recent years) or because the best students have moved away to other subjects or qualifications (such as the Pre-U, an alternative to A-levels).  We know that some bright children found the unreformed GCSEs rather undemanding, which is why independent schools put them in for the iGCSE, instead – so some of the best students were removed from the group taking GCSEs.

Where a new syllabus has been introduced, it is quite possible that it will be slightly easier or slightly harder than the previous syllabus. But because the new syllabus is different from the previous version it is not as easy to simply compare this year’s scripts with last year’s. Furthermore, experience tells us that results often dip when a new syllabus comes in because teachers are less used to it.  This is called the sawtooth effect and Ofqual research shows that the effect normally last three years.  This is potentially unfair on candidates, so when a new syllabus starts Ofqual asks exam boards to operate on the ASSUMPTION that the grade distribution this year will be very similar to last year if the cohort is similar.  This is comparable outcomes.

The way they have checked whether this year’s GCSE cohort is similar to last year’s is by comparing their Key Stage 2 tests scores in maths, English and science. If this year’s cohort is similar to last year’s Ofqual expects there to be a similar grade distribution in any given subject.  However, it is not a firm requirement – if an exam board can prove to Ofqual that standards really have risen this year, then it can give higher grades. And of course it only applies to the grade distribution for the cohort as a whole, not to individual candidates! A child who did well in KS2 but was lazy for the next five years would still do badly at GCSE.

Of course the use of KS2 data could be criticised as the basis for this comparison – it is five years old and most pupils in independent schools do not take KS2 tests so the data for them is missing.  From 2018 onwards Ofqual will do comparable outcomes by making use of National Reference Tests – tests taken in English and Maths by 18,000 children shortly before they sit the GCSE itself. Marks in the National Reference Tests will be aligned to the new GCSE grade boundaries in maths and English Language. The same test will be used every year so it will be possible to see whether standards in Maths and English have really risen or not.

So for example, if in 2018 55% of students get grade 4 or better and in the NRT 55% got 38 marks or better, this means 38 can be set as the minimum mark for grade 4.  SO if in the next year’s NRT 57% got 38, the proportion getting a grade 4 in the GCSE could rise to 57%.  In reality, because of the sawtooth effect, the NRT may not be used much before 2020.  Furthermore, the results of the NRT will not be used without other evidence to support a rise or fall in the number of higher grades.  Examiner judgement and the results of the Key Stage 2 tests for any given cohort will also play a part.

Does the NRT have any bearing on subjects other than maths and English?  Probably not, although further statistical work will need to be done on that.

Comparable outcomes only works for subjects where there are large groups of students and the make-up of the cohort is similar from one year to the next.  It cannot be used for very small subjects or where the cohort has changed, as happens quite often.

600,000 children sit GCSEs every year. In 2015 5.6 million GCSE subjects were sat….11 million exam scripts. Human judgement is central to the grade achieved by every child, but with such huge numbers a statistical framework which acts to ensure fairness from one year to the next is needed.

 

           The Good and Cresswell effect

When judging performance, we are more likely to favour performance against an easier challenge than against a more testing one. This was the conclusion that Good and Cresswell came to when they studied how examiners rated the work of students facing different levels of difficulty in question papers.

Their conclusion was the following:

“The awarders tended to consider fewer candidates to be worthy of any given grade on harder papers or, alternatively, that more candidates reached the required standards on easier papers.” (Good & Cresswell, 1988)

If judgement alone was used in monitoring the standards of examinations, whenever a hard question paper is set performance will appear to have declined, whereas whenever an easier question paper is set performance will appear to have improved.

The allocation of grades is based on a combination of Ofqual’s statistical recommendations based on comparable outcomes AND examiner judgement.

Raising the bar

One method used by the DfE since 2010 to raise standards in England is to raise the bar in terms of performance measures (how schools are judged) and floor standards (the minimum acceptable level below which action will be taken).  Floor standards have been raised for both primary schools (Key Stage 2 tests) and GCSEs.  In 2016 primary schools needed 65% of pupils to have met the national standard in all of reading, writing and maths. Floor standards at GCSE are based on the Progress 8 measure.  Floor standards post-16 are based on measures of attainment, progress, retention and destinations.

Up until 2017 the DfE had regarded grade C/4 as the pass grade for GCSE.  From that year the new grade 5 (a high C, low B) will be used in performance tables for the EBacc measure and is designated a ‘good pass’.

Floor standards have a political purpose: during a period where government wishes to make all schools become Academies, raising floor standards drives more and more into that net.

Raising floor standards at GCSE on the assumption that this may encourage more schools to improve is in direct contradiction to the principles of comparable outcomes grading.  Comparable outcomes means that the same proportion of pupils nationally will always fall below the pass grade each year.

What are grade descriptors?

GCSEs and A-levels have grade descriptors for each subject. These give “a general indication of the standards of achievement likely to have been shown by pupils awarded particular grade.” Here are part of the grade descriptors for Art GCSE:

To achieve Grade 8 candidates will be able to:

  • demonstrate independent critical investigation and in-depth understanding of sources to develop ideas convincingly
  • effectively apply a wide range of creative and technical skills, experimentation and innovation to develop and refine work
  • record and use perceptive insights and observations with well-considered influences on ideas
  • demonstrate advanced use of visual language, technique, media and contexts to realise personal ideas

 

To achieve Grade 5 candidates will be able to:

  • demonstrate competent critical investigation and understanding of sources to develop ideas coherently
  • apply a range of creative and technical skills and some experimentation and innovation to develop and refine work
  • record and use clear observations to influence ideas
  • demonstrate competent use of visual language, technique, media and contexts to realise personal ideas

 

Until recently these grade descriptors were used for grading exams.   If you can imagine applying these to art portfolios, you will see the difficulty with grade descriptors!  Everything hinges on the meaning of words such as ‘effectively’ and ‘in-depth’.   So now they are merely used to guide teachers as to the approximate meaning of each grade.

So how are exams graded?

Some grade boundaries are defined by EXAMINERS’ JUDGEMENT. To identify the boundary mark for each of these JUDGEMENT GRADES the procedure followed by the exam boards is as follows:

First, taking a sample of A-level scripts whose marks are believed to be close to, say, the A/B grade boundary, a group of experienced examiners look at the scripts starting with those with the highest mark and working down. They try to agree on the lowest mark for which there is consensus that the quality of work is worthy of an A rather than a B. This is called the upper limiting mark.

Next, taking scripts which have marks which are a bit worse than the upper limiting mark and working up from the bottom, they identify the highest mark for which there is consensus that the quality of work is not worthy of the higher grade. The mark above this forms the lower limiting mark.

The chair of examiners must then weigh all the available evidence – quantitative and qualitative – and recommend a single mark for the A/B grade boundary, normally within the range of the two limiting marks.

The chair of examiners makes the recommendation to the officer of the exam board with overall responsibility for the standard of qualifications. That officer may accept or vary the recommendation and will subsequently make a final recommendation to Ofqual. Ofqual may approve it or give reasons why they are not happy with it.  In the latter case the exam board must reconsider and produce a final report.  Ultimately Ofqual can direct an exam board to prevent it setting what it believes to be unjustifiably high or low grade boundaries.

BUT NOT ALL GRADES ARE JUDGEMENT GRADES. SOME GRADE BOUNDARIES ARE DEFINED ARITHMETICALLY. AT A-LEVEL THE FOLLOWING HAPPENS:

A-levels are graded by establishing the ‘correct’ mark for the bottom of A and bottom of E as described above and the range between those two is divided equally into four creating the B/C, C/D and D/E boundaries. So the bottom of grades A and E are determined judgementally using two elements:

 

  1. ‘Comparable outcomes’: the proportion of the cohort who achieve an A and E should be similar to the previous year.
  2. Examiner judgement: the quality of the scripts at these grade boundaries. They compare scripts from last year that were on a grade boundary with this year’s scripts.

 

The B/C, C/D and D/E boundaries are determined arithmetically.

In the first years of the reformed A-levels (first exam 2017) the A* will be set using the comparable outcomes method: the proportion getting an A* will be roughly the same for each subject every year but can be slightly increased or decreased if the GCSE results of the previous year’s cohort suggests it was stronger or weaker than this year’s cohort.

Of course if the cohort taking an AS or A-level is very different to that taking it in the previous year, the comparable outcomes method is harder to operate. If a group’s GCSE results tell us that they are different from the previous cohort, then examiner judgement (the quality of the scripts) plays a greater role. Similarly, if the cohort size is small statistical predictions are less reliable and examiner judgement plays a bigger role.

 

For the first award of GCSE grades 1 to 9, and where the size and nature of the candidature in a subject allows, grading will be based primarily on statistical predictions. Examiner judgement will remain as part of the process, but because of the reforms to each subject, and the move to a new grading scale it will be less reliable than in normal years.

So broadly the same proportion of students achieve a grade 4 and above as previously achieved a grade C and above in the subject.  Broadly the same proportion of students achieve a grade 7 and above as previously achieved a grade A and above in the subject.  Broadly the same proportion of students achieve a grade 1 and above as previously achieved a grade G and above.

Grades 2, 3, 5 and 6 are awarded arithmetically so that the grade boundaries are equally spaced in terms of marks from neighbouring grades. Grade 5 is positioned in the top third of the marks for an old grade C and the bottom third of the marks for an old grade B.

Grade 8 is awarded arithmetically such that its grade boundary is equally spaced in terms of marks from the grade 7 and 9 boundaries.

Across all subjects (as opposed to within each individual subject) close to 20 per cent of those awarded a grade 7 or above will be awarded a grade 9.  However, the proportion of grade 9s awarded in each subject will vary depending on the overall proportion of grades 7 and above awarded within the subject; a formula will be used to achieve this:

Percentage of those achieving at least grade 7 who should be awarded grade 9 = 7% + 0.5 x percentage of candidates awarded grade 7 or above.

 

So the award of grades is complex and well thought-through, but has a degree of arbitrariness about it:  some grades are determined arithmetically and all are affected by the grade distributions used in previous years.

If grades are a bit arbitrary why do we use them? Because it is easier to make them comparable over time than raw scores. Raw scores will vary from year to year depending in part on the level of difficulty of the papers set. Grades on the other hand are assigned on the basis that a grade B this year should mean something similar to a grade B last year. The main problem with grades is really the cliff-edge effect: the difference between one grade and another is just one mark. But it is also the difference between going to university and reading medicine or not….so it affects students’ lives quite dramatically.

How might the cliff-edge effect be overcome? The answer could be to publish more information. Instead of universities being just given students’ grades they should be given details of how close they were to a higher grade boundary or the students’ rank in the list of all those taking an exam. Given that all such data is now held digitally, neither of these things would be difficult. Incidentally, the merit of giving ranks is one of the arguments for having just one exam board offering each subject. Ranks are not impossible if you have multiple exam boards, but much more difficult.

 

The two basic types of grades

 

Norm-referenced grading simply means that every year you give a fixed proportion of candidates any given grade – 10% get an A, 25% get a B and so on.  This method has two great advantages – it tells universities and other people who are selecting the best students where each student is in the rank order, and if the percentages getting each grade are fixed from one year to the next there can be no grade inflation.

 

We use norm-referencing all the time in our day-to-day lives.  For example, my car may get 40 miles to the gallon, but I only know how good that is by comparing it with other cars – some of which are more fuel-efficient, some less.  We compare our results with others.

Norm-referencing has distractors because it appears to create winners and losers.  You may be a pretty good classicist but if 55% of the other candidates are even better than you your grade may be depressed.  You are ‘below average’ and however good you are that makes you feel like a loser.
Another problem with having a fixed % of students getting each grade is that it does not tell you what each student actually knows.  After all, half the students taking a subject might be very good at it (as would be the case with GCSE Latin or A-level Further Maths for example) but under the fixed % system some of these very good students might only get a C grade just because a proportion of the candidates were even better.

So another problem with the fixed % system is that is has the potential to be demotivating.  Individual students and indeed whole schools might well feel that there is little point in striving to do better because under this  system it is so much harder to achieve improved grades – you can only improve if other people or other schools do relatively worse.

 

The alternative is called criteria-referenced grading where the exam board starts by defining what candidates must know in order to achieve any given grade.  Normally there will be an attempt to define this in words and there may also be a minimum mark needed for each grade.  This system has the advantage that if all candidates perform very well, all can achieve a high grade.  In the case of subjects which normally only taken by fairly able pupils, such as Further Maths, Latin or Greek, this seems only fair.

A form of criteria-referenced grading is what we have used for public exams in the UK in recent years, but the system has suffered from grade inflation as candidates have been better and better prepared.  If too many students gain a high grade, universities can no longer select the best students using public exam grades – they have been forced to create their own special tests.

Furthermore, it is not easy to define on paper what any given grade should look like – the grade descriptors are highly subjective.

 

Our current system uses both norm-referencing and criteria-referencing.  It is based on criteria-referencing to the extent that examiners have to judge the quality of scripts, but the comparable outcomes approach to grading has introduced a greater element of norm-referencing.

 

 

Reliability of an assessment

 

Reliability of an assessment is a technical term used by examiners.  It means the extent to which the assessment consistently and accurately measures learning.  When the results of an assessment are reliable, we can be confident that if we repeated a test with the same or similar pupils tomorrow it would provide similar results.  If you weighed yourself on a set of scales you would hope that if you repeated the weighing five minutes later the result would be the same – that is reliability.

Factors which can affect reliability include:

  • The length of the assessment – a longer assessment generally produces more reliable results.
  • The suitability of the questions or tasks for the students being assessed.
  • The phrasing and terminology of the questions.
  • The state of mind of the students – for example, a hot afternoon might not be the best time for students to be assessed.
  • The design of the mark scheme and the quality of the markers.
  • The consistency of the methods used to turn marks into grades.

     Measurement error

In fact if a pupil takes the same or very similar exam several times they will not get exactly the same score every time.  In 2013 quite significant numbers took both the GCSE and iGCSE in English;  one third of them got the same grade in both, one third got a higher grade in the GCSE, one third did better in the iGCSE.  Because both qualifications are supposed to work to the same grading standard, all candidates should have obtained the same grade in both.  The fact that they didn’t was partly due to measurement error, a rarely-discussed characteristic of all exams.

The measurement error might have been caused by the fact that the two exam papers contained different questions and students may have been better prepared for one than the other.  Or because all students have good and bad days. Or because of slight differences in the harshness or leniency of markers.

In public exams we transform marks into grades.  The line between one grade and another (the cut score) is always slightly arbitrary.  Many candidates have marks close to a grade boundary.  The margin of error certainly means that some candidates who one day got a grade B might on the next day get an A.  The use of marking tolerances in long-answer papers like History A-level means that two IDENTICAL scripts can legitimately be given different marks and thus different grades.

The message for pupils is: try to avoid being near a lower grade boundary.

The message for universities and others selecting young people is: try to obtain evidence beyond one set of exam results.  Oxford and Cambridge universities are of course the most selective universities in the UK.  They typically use a range of measures to select students: GCSE results, AS-level results where they have them, A-level results, pre-test results such as the BMAT test for applicants for medicine, the school reference, the student’s own personal statement, a sample of written work submitted by the student, and two interviews.  This is sensible, if expensive.

The message for Governments is: if you are going to use test and exam results for school accountability purposes, employ an expert in statistics to calculate the margin of error so that decisions are not made on the basis of inconclusive data.

 

        Sampling error

Whereas measurement error applies to the mark achieved by individual pupils, sampling error applies to errors that arise from the selection of a sample of particular students or schools for the purposes of measurement.  For example, the PISA tests sampled 20,000 15-year olds in the UK in 2015.

In the 2015 UK General Election and the 2016 UK referendum about membership of the EU most polls, even those taken on the day of the vote, were wrong.  In 2015 few predicted a Conservative victory.  In 2016 most predicted a victory for Remain.   The polls are based on a sample of the voting population, carefully chosen of course but only a sample.  The poll results had a known margin of error (say plus or minus 2%) which in both cases was fundamental to the outcome as both the General Election and the Referendum results were close.  But the British public were not given this data.  Both polls were, incidentally, affected by what is called social desirability bias – a tendency for people to give an answer which they felt was more socially acceptable.  Many Conservative voters (‘shy Tories’) felt it was more socially acceptable not to admit to voting Tory as did many EU Brexiters.

In education sampling error applies where people quote aggregate test scores from a sample of students, such as the PISA tests measuring a sample of 15-year olds.  Sampling error is more likely if the sample is small or, of course, if the sample is unrepresentative of the whole.  With PISA this happens when they try to draw conclusions about sub-sets of their sample, such as the performance of different ethnic groups or of private schools: here the sample size is small and much more prone to error.

Sampling error can also apply to conclusions made about individual schools.  It is dangerous to draw conclusions about the effectiveness of a school by looking at the exam results gained by just one year group in one particular June because that sample year group may not be typical of all year groups in the school.  This is especially a problem for small school where the very bad performance of just two or three children can distort results.

 

Validity of an exam

Validity is a technical term used in the world of exams which has a specific meaning and should not be confused with the normal use of the word.  Validity in exams is not about the exams themselves, it is about the validity of the inferences you can make from the exam results.  The validity of an assessment is the extent to which it measures what it was designed to measure. 

 

For example, a test of mathematics should measure mathematical ability and not the ability to read English so a maths exam with lots of English text will reduce its validity as a test of maths.

 

In recent years in England university Physics departments complained that students coming to them with good grades in A-level Physics were not ready to take an undergraduate degree in Physics because their maths was too weak.  A Physics A-level should measure your ability to do Physics at a good level.  But because of dumbing down much maths had been stripped out so the qualification lacked some validity.

 

One of the difficulties examiners have is deciding how long a set of exams should be.  An exam only samples a pupil’s knowledge, but if the sample is too small the less likely it is to test all the various domains of subject content.  Examiners rather pompously call this construct under-representation.  It makes the exam less valid.

 

An example of construct under-representation was the French oral pupils took for GCSE French in recent years.  The oral was supposed to test the ability to speak in French but the way it was designed allowed pupils to rote learn a selection of phrases and spout them out to the examiner.  The test did not really measure their ability to speak French.

 

There is a link between reliability and validity.  For example, if the purpose of a design technology course is to measure the ability of students to make products out of wood, metal or plastic with technical skill and imagination you are going to have to rely on a good deal of coursework to make the assessment valid.  But coursework of this sort, spread over many weeks with the teacher giving greater or lesser degrees of help is a less reliable form of assessment.

On the other hand, insisting on highly consistent assessment conditions to attain high reliability will result in little flexibility, and might therefore limit validity.  If all DT students have to make the same product, for example, the assessment will not measure imagination.

Advertisements