#DevelopingExpertTeaching Series: Assessing student performance and learning.

This post is the second in a series of six posts (the first post will be available shortly!) following my progress through the Masters in Expert Teaching programme in am undertaking with Ambition Institute. In this article I look at assessment, specifically assessment in science in UKS2. The structure this article follows is the standard…

This post is the second in a series of six posts (the first post will be available shortly!) following my progress through the Masters in Expert Teaching programme in am undertaking with Ambition Institute. In this article I look at assessment, specifically assessment in science in UKS2. The structure this article follows is the standard structure of all Master’s assignments:

  1. Explore
  2. Translate
  3. Implement
  4. Investigate
  5. Disseminate


Assessing student performance and learning in upper KS2 science through the implementation of hinge questions.


The difference between learning and performance is well researched and evidenced, (Bjork & Bjork, 2011; Soderstrom and Bjork, 2015; Sumeracki and Weinstein, 2018) and for teachers on the front line in the classroom, it can be easy to fall into the trap of confusing performance for learning and moving through curriculum content at too fast a pace, only to be shocked when children fail to recall the content in subsequent lessons or during assessments. Wiliam’s suggestion that teachers should build a Plan B into a Plan A struck me as not only common sense, but as a way of lesson planning that put misconceptions at the heart of that planning. Through using hinge questions to support teaching and learning in upper key stage 2 science, I found that I was able to focus my immediate intervention on specific children based on the misconception I identified they had at that exact moment, meaning it could be addressed at the point of learning, not a week later. This resulted in children remembering more science content correctly and understanding the underlying scientific concepts in greater depth. Investigating whether this could be applied to other subject areas in other year groups saw similar results.

Exploring assessment research:

Assessment of children in British schools is not a new thing; Dylan Wiliam notes in his foreword of Daisy Christodoulou’s book Making Good Progress (2017) that in 1987 the UK Government intended to introduce a new National Curriculum and over the summer working parties were set up and the then Education Secretary, Kenneth Baker, asked Professor Paul Black to chair the National Curriculum Task Group on Assessment and Testing (p.5). That was 34 years ago. And still to this day, conversations, debates and studies are ongoing to try to answer the question: how do you best assess student performance and learning? No organisation can claim to have developed a holistic assessment system which both attends to effective formative and summative assessments comprehensively, without fault. Furthermore, the last two years has shown those working in the education sector that the current high stakes examination-based system is indeed dispensable, or, at very least, the current system should be reviewed.

To examine such a broad area as assessment theory, I will focus on three of the overarching concepts that are fundamental to affecting any change on children’s learning:

  1. The purpose of assessment
  2. Reliability and validity
  3. The difference between learning and performance

The purpose of assessment

Coming from the Latin ‘assidere’ which literally means ‘to sit beside’ one can assume that assessment means to form a judgement about capabilities. However, as with so many words, the meaning over the years has changed and come to mean many different things and even within education, we assess for an extraordinarily wide range of reasons. As a result of assessments being such an integral part of the everyday life of a school, we often forget that they are conducted for particular reasons.” Specifically, assessments are conducted to draw conclusions. Cronbach (1971) pointed out many years ago, an assessment is just a procedure for making inferences.

In schools, two main types of assessment seem prevalent: summative and formative assessments. All too often, teachers assess because their school policy requires them to. Some, particularly teachers earlier on in their career, do not have a robust enough understanding of why they are assessing and what they are hoping to achieve by assessing. There is no shared vision of what the school hopes to achieve though assessment and what the outcomes will be used for (other than accountability). The purpose of assessing summatively is to create a shared meaning and the purpose of assessing formatively is to produce a consequence for the teacher and pupil.

However, the confusion between formative and summative assessment and their primary purposes has seemed to cause teachers great confusion, most notable perhaps where end of phase SATs tests have been used at the end of each term to provide an “idea of whether children will get to age related expectations”.

This is further compounded by the stakes that are placed on assessments in schools. “Part of the reason schools grade pupils frequently is because they are required to do so for accountability purposes” (Christodoulou, 2017, p.75) and the more frequently these grades are issued, the less likely a child is to have made perceived progress in their test score, which teachers must justify in their termly Pupil Progress Meeting with Senior Leaders. Because of this, the way in which children sit the assessment and the preparation for it, varies radically between teachers in schools meaning the reliability of the inferences made from these assessments is called into question. I agree with Christodoulou who notes “it is very hard to make genuine and significant improvements on such big domains over just six weeks” (2017, p.129) and that “measuring progress with grades therefore encourages teaching to the test, which compromises learning.” (2017, p.129).

Possible reforms to England’s summative assessment regime are out of the range of this work, therefore the focus of this work will be on securing sound formative judgements within the classroom.

Reliability and validity

We can only ever make inferences about what children have learned. Nick Rose (2021) described an inference as “using the evidence available to you to make a useful judgement.” No one can say for certain what a child has or has not learned. Therefore within any classroom, it is vital that a teacher’s assessment measures what it claims or purports to measure (Wiliam, 2014, p.22).

The use of terminology such as “inferences” when talking about assessments is important as it shifts thinking away from single assessments – whether formative or summative – having pinpoint accuracy, but as being thought of as one piece of the ever-shifting picture of a child within a class.

Validation is the central concept in assessment and as Wiliam (2014) states “it is the process of establishing which kind of conclusions are warranted and which are not.” Validity refers to the extent to which the inferences about learning are accurate and free from bias. In effect are we measuring what we intended to measure?

One important aspect of validity is reliability. Reliability refers to the consistency of a judgement about learning. In educational measurement, a test is reliable if the test scores are consistent. Christodoulou (2017) identifies threats to reliability, which include:

  • Sampling: most tests do not directly measure a domain, they only sample from it and students can do better or worse depending on the particular sample.
  • Marker: different markers may disagree on quality and applying standards consistently is difficult, even for one marker.
  • Student: performance can vary from day to day and students perform differently depending on factors such as illness, time of day and whether they have eaten beforehand or not.

From the factors listed above, we can see why it is impossible for any test to be completely reliable.

Reliability in the assessment of student learning is also about accuracy and consistency; the higher the stakes of the decision we want to make based on assessment information, the more accurate and consistent we want the information to be. High-stakes decisions need highly reliable information.

To contextualise the importance of these concepts further within a primary school setting, the science Teacher Assessment data and the science Sampling Data collected from 2018 highlight a significant difference. Science is considered to be a core subject within the primary setting, however, it is not afforded the same core subject status as English and maths. Furthermore, children do not sit an end of Key Stage Test in science. Teachers are expected to report a teacher assessment for science attainment at the end of Year 6. On average, teacher assessment from that year indicated that 82% of children achieved the Expected Standard in science, whereas the Sample Tests revealed that only 21% of children achieved the Expected Standard in the same subject in the same year. This makes it difficult to make valid inferences about what effective science learning looks like as there is not a clear consensus in this area. As a result of this, formative assessment and responsive teaching in Year 5 science will form the basis for my exploration.

The difference between learning and performance

One of the predominant issues within the classroom setting is to identify performance versus learning with the primary goal of instruction being to facilitate long-term learning—that is, to create relatively permanent changes in comprehension, understanding, and skills of the types that will support long-term retention and transfer. The “distinction between learning and performance is crucial” (Bjork & Soderstrom, 2015, p.176) and so too is the distinction between the different methods of assessment used in a classroom setting and their intended outcome. Durability of learning is key. As teachers we want knowledge and skills to be durable in the sense of remaining accessible across periods of disuse and to be flexible in the sense of being accessible in the various contexts in which they are relevant, not simply in contexts that match those experienced during the initial lesson or instruction. In other words, instruction should endeavor to facilitate learning, which refers to the relatively permanent changes in behavior or knowledge that support long- term retention and transfer. On the other hand, Bjork and Soderstrom note the contradictory concept of performance, which refers to “the temporary fluctuations in behavior or knowledge that can be observed and measured during or immediately after the acquisition process.” (2015, p. 176). This distinction needs to be secure within a teacher’s understanding if they are to apply the principles of effective assessment to their classroom. Bjork and Soderstrom acknowledge the fact that “considerable learning can occur in the absence of any performance gains and, conversely, that substantial changes in performance often fail to translate into corresponding changes in learning” (2015, p.176)

In my early career as a teacher, my understanding of assessment was more rudimentary with an end of unit test being used to measure either attainment or progress (or a muddled combination of the two) from one point to another with no real consideration of the outcomes or consequences thereof, or, naively assuming that performance in a lesson was a good proxy for learning. Wiliam noted that “for many years, the word “assessment” was used primarily to describe the processes of evaluating the effectiveness of sequences of instructional activities when the sequence was completed.” (2011, p.1)

Impact on my classroom practice

In their systematic review, Heitink et al. identify several prerequisites for successfully implementing assessment for learning in the classroom. Two broad categories struck me as being important: the first being teacher knowledge and skills whereby they identified that teachers needed “assessment literacy… that is, the knowledge and skills to collect, analyse and interpret evidence from assessment and adapt instruments accordingly.” (2016, p.4) They developed this further by identifying that “without understanding a concept or without knowing a common misconception related to a subject, teachers were not able to provide accurate and complete feedback.” (2016, p.4). This is particularly pertinent in the primary setting; the shift to knowledge-rich curriculum has meant that primary school teachers are expected to be expert in many subject areas.  They also noted that “because AfL takes place in everyday classroom practice, such as during discussions … teachers need the ability to interpret information about students learning on the spot.” (2016, p.5), which, when linked to the point above, can provide challenges, particularly teachers early on in their career.

The second broad category is teacher belief and attitudes. Heitink et al., noted that teachers’ beliefs, attitudes, perspectives and philosophy about teaching and learning influence the quality of AfL implementation.” (2016, p.5) Given that this is a highly subjective and personal area of teacher practice, the ramifications on the validity and reliability of the inferences being made about assessment data must be significant. Heitink et al., went on to note that “in addition, both  Aschbacher and Alonzo (2006) and Birenbaum et al. (2011) found that the quality of AfL practice  is influenced by the extent to which teachers feel responsible for student attainment of goals rather than just coverage of the curriculum.” (2016, p.5). Again, this is going to create a noisy measure when trying to weigh up the validity and reliability of different teachers’ inferences. Does this require a cultural shift in the way that schools think of assessment as a whole? Surely there are ethical considerations here to ensure all children get a fair and consistent offer, regardless of the teacher they have and their personal thoughts towards assessment?

Over time, and as methods of assessment such as Assessment for Learning became more talked about, my repertoire of assessment methods fell in line with the four main interventions that Taras reiterated: questioning, feedback through marking, peer- and self-assessment, and formative use of summative tests. (2010, p.3015)

My classroom practice will be positively impacted by Wiliam’s ideas that “the purpose of assessing summatively is to create a shared meaning and the purpose of assessing formatively is to produce a consequence for the teacher and pupil” and it is the latter that I feel needs further exploration in light of the research.

One of the most persistent challenges within a classroom is knowing what children are thinking at different points within a lesson or sequence of lessons. More specifically Harry Fletcher-Wood describes how “it is hard to know what students are thinking, so they may maintain errors and misconceptions through the lesson” (2018, p.75). This has been an issue that I have faced throughout my career, and it has undoubtedly resulted in worse pupil outcomes than might otherwise had been possible. After all, if I don’t know which children hold a misconception, how can I address it? As a result the main thrust of this Move has been to mitigate these unknown factors as much as possible through implementing hinge questions which provide instructionally tractable information in order to adapt instruction for maximum impact.

Through the implementation of responsive and adaptive teaching, the circumstances for formative assessment in science within Year 5 could provide me with instructionally tractable information from a variety of assessment methods that can be immediately used in my instruction with the goal of advancing learning. The development of hinge questions within science will provide me with the opportunities to respond to the assessments I make in ways that I would not be able to without them being present. Furthermore, by making use of carefully considered hinge questions with distractors that identify misconceptions, I will move closer to understanding what misconceptions my children hold and will be able to put instruction or intervention in place to remedy these.

So how did this research translate into my classroom? Firstly, it is important to identify the ‘problem’ of practice.

The aspect of my professional practice that I am aiming to develop relates to making accurate assessments in science learning in my year 5 class. More specifically, identifying the possible misconceptions children have with particular scientific concepts through the implementation of hinge questions. Ideas and constructions that children develop through informal learning and development can “provide a shaky foundation for new concepts.” (Allen, 2020, p.5) This is particularly prevalent within the primary setting given the huge disparity between Teacher Assessment in the subject compared to the summative assessment outcomes of the Key Stage 2 Science Sampling outcomes (2019). When reviewing the Teacher Assessment data for my current cohort, it shows that 60% of children met the Expected Standards with 25% being judged as “Has Not Met” at the end of KS1. 15% of children did not receive an assessment as they were not at the school in Year 2. However, I am cautious of the reliability of this data knowing the huge disparity between Teacher Assessment and SATs Sampling Data nationally. This is at significant odds with both the average of 80% Teacher Assessment and the 22% standardised assessment outcomes, which presents an opportunity for further exploration, particularly as Teacher Assessment in English and maths is relatively close to summative assessment outcomes from SATs and has improved in accuracy since 2016, as the table below shows.

2016 2017 2018 2019
Reading 80% 66% 79% 72% 80% 75% 73%
Writing/GPS 74% 73% 76% 77% 78% 78% 78% 78%
Maths 78% 70% 77% 75% 79% 76% 79%
Science 81% 22% 82% 82% 21% 83%

Table 1: End of Key Stage 2 Outcome – Teacher Assessment vs SATs

In the systematic divergence between teacher and test-based assessment: literature review (2021) it noted that “when the levels from the teacher and the test did not match, teacher under-rating relative to the test results was slightly more common than over-rating in all analyses of science, but there was no clear tendency under- or over-rating in English and maths.” (2021)

This directly contradicts the findings of the National Reference test data, providing yet another example of how assessment in science is confused and unreliable.

Science is a core subject. However, in most primary schools, it is not given the same status as English and maths, in terms of curriculum time or professional development for staff. Because of this, there are several key issues with science learning in primary schools in England, these have been recently reported by Bianchi (2020) as:

  1. Children’s science learning is superficial and lacks depth
  2. Children’s preconceptions aren’t adequately valued
  3. Children’s science learning lacks challenge
  4. Children are over reliant on teacher talk and direction, they lack autonomy and independence in science learning
  5. Children are engaged in prescriptive practical work that lacks purpose

The observations made by Bianchi coupled with the contradictory narratives presented by the data for primary science, paint a picture of confusion in both the planning, teaching and assessment of primary science in England.

For my practice, this picture presents challenges, particularly concerning the use of assessment data from short units of science, along with End of Unit tests, which are unreliable as they sample for a small percentage of the overall domain and are designed by the class teachers. Tests are only ever a proxy for what we want to measure. In essence, teachers can be biased towards selecting questions they know the children will get right. This unreliable data is then used to make an inference and attribute an end of Phase assessment grade to children’s science learning.

A hinge question is a multiple-choice question which provides an immediate check of students’ understanding. Crucially, a hinge question provides a check of understanding for every student in a class which provides the class teacher with data as to whether to proceed with the lesson or reteach and consolidate the lesson content.

There is evidence (Christodoulou, 2017; Wiliam, 2011 & 2014; Fletcher-Wood, 2018) suggesting that aspects of assessment for learning, formative assessment and responsive and adaptive teaching can have an impact on children’s learning, however, Professor Rob Coe indicated that “there has been no (or at best limited) effect on learning outcomes nationally” (2013, p.10), which Wiliam attributes to there being a “focus on what was deficient about the work… rather than on what to do to improve their future learning” (2011, p.120). The benefit of hinge questions, in this case, becoming a feature of each science lesson is that they will provide instructionally tractable data that will inform the direction of the lesson. This is immediately apparent and does not rely on waiting until after the lesson to discover what children could – and more importantly could not – understand. This shifts the focus from assessment of learning done summatively at the end of the lesson to responsive teaching done in the moment. This move could have significant implications on the outcomes of children’s learning in science and so provides an ethical basis for pursuing it further. It is unlikely that the implementation of hinge questions will damage pupils’ learning.

Although seemingly straightforward in their implementation, effective hinge questions need to be carefully considered in their design. They need to be a closed question to capture what all are thinking, must be well structured so that a response can be gathered in approximately 30 seconds and designed to elicit misconceptions, so not rely on self-reporting. Wiliam notes that “there’s nothing new in this idea, but it turns out it’s rather difficult to do.” (2015, p.40)

I am drawn back to Butler (2017) who suggests that the use of multiple-choice questions could be used to develop and improve the use of cognitive and metacognitive strategies. The specific design of each multiple-choice quiz can determine the outcome and the impact on future learning. Butler (2017) noted that “tests do more than just assess learning – they also cause learning.”

There is no doubt that I am making use of Assessment for Learning strategies and asking several pertinent questions throughout each lesson to check for understanding, however these are not well crafted enough to elicit specific enough information to highlight misconceptions within the class.

Because the methods being implemented form part of my day-to-day practice, I do not consider explicit parental consent to be necessary. Further, all data gathered because of my move will be stored and managed in line with already established school policies and will conform to data protection regulations.

There are several ways in which the impact of this ‘move’ could be evaluated.

Two of the main assessment types are norm-referenced and criterion-referenced tests. Norm-referenced tests compare a child performance against the performance of their peers – usually within a large sample (i.e., nationally). Criterion-referenced tests compare a pupil’s knowledge acquisition against a predetermined standard or criteria. In this style of test, the performance of other’s does not affect a child’s outcome.

Some tests can provide criterion-referenced and norm-referenced results. A child may have a high percentile rank, but not meet the criterion for proficiency. Questions must then be asked as to whether that child is doing well because they are outperforming their peers or doing poorly because they have not achieved proficiency.

At the end of Key Stage 2, SATs tests are the usual national measure of progress and attainment. However, children do not sit SATs tests in science. Nor has my setting been chosen to undertake the Science Reference Tests. As part of our usual assessment cycle, at the end of each Enterprise (the collective term for our wider curriculum of foundation subjects), children sit a criterion referenced test. I will utilise this established method to gather data and measure impact. Within this test, I will include questions linked directly to the scientific concepts covered with the content of lessons. Further to this, I will be utilising the expertise of my Science Subject Lead to conduct semi-structured interviews with a range of children to ascertain their feelings towards the impact of the hinge questions.

Given that the primary purpose of hinge questions is to provide that instructionally tractable data, I will create and embed at least one hinge question into each science lesson. Below is a visual representation of the fluid model I will consider during each lesson, knowing that each lesson could require a unique approach, depending on the factors evident in the lesson:

Reviewing this data quickly and efficiently is crucial. To do this effectively and record children’s responses over several lessons in order to build a more comprehensive picture of their abilities, I will use a digital app called Plickers. This will identify which children answered the hinge questions correctly, but more importantly, those who did not. It will give me an overall percentage and allow me to forensically analyse things like groupings. Further to this, it will provide me with which children selected which misconception, thus allowing me to support children in addressing this view in order to consolidate the correct knowledge.

Conversely, the hinge questions could be used to elicit scientific thinking from those children who could potentially achieve at a greater depth. For example, a child who chose the correct answer could verbalise their reasoning as to why they did not choose an alternative answer. This promotes both support and challenge, ensuring all children can reach their potential. This is an ethical consideration that has been present throughout my career but needs careful planning when implementing specifics “Moves”.

I have considered the use of attitude scales, completed before and after the move had been implemented. It is my view that these would lack reliability as children would potentially respond more positively given that I am both their classroom teacher and Deputy Headteacher. They are more likely to respond with an answer they perceive I would want. However, these implemented by my partner teacher could yield useful data for analysis about children’s perceptions of science lessons before and after hinge questions have been implemented.

So how do you take the theory of a seemingly effective assessment methods in the action of a classroom setting? So far, I have implemented at least one hinge question into each science lesson over a period of a half term; this has ensured coverage of the National Curriculum content for Year 5 Earth and Space and most recently, Forces and Motion. As a result of this, I have worked through the process of identifying key concepts that students must understand to move forward (Willingham 2002; Wiliam, 2015) and identified common misconceptions that arise around the key concept (Fletcher-Wood, 2018). Through some additional reading of Misconceptions in Primary Science by Michael Allen, in which he identifies the scientific concept, described as a “scientifically accurate explanation” (p. xxi), and the corresponding misconceptions, which he identifies as “scientifically incorrect statement expressed in words that a child might be likely to use” (p. xxi), I am confident that I have selected and integrated the right scientific concepts and have a solid understanding of the associated misconceptions.

From this strong starting point, I was able to use common misconceptions to construct a multiple-choice questions that addresses this misconception with a lesson (Fletcher-Wood, 2018). Given the fact that these questions needed to ensure all distractors were plausible and addressed misconceptions (Wiliam, 2015; Kirby 2014; Christodoulou, 2016) this was a bigger challenge than I had anticipated, as previous multiple-choice questions had not been set against these criteria, meaning that one answer was an obvious wrong answer and provided no useful information.

Identifying a key hinge-point within a lesson where the hinge question will occur (Wiliam, 2015; Fletcher-Wood, 2018) varied dependent on the cognitive demand of the lesson. Some questions were asked earlier to check for understanding between key instructional inputs, others were asked before moving children onto their independent practice. This aligned fully with what I set previously – a flexible model of implementation that meets the needs of the children based on the live class setting.

Using a digital platform, I was not only able to “obtain a response from every pupil” (Fletcher-Wood, 2018) but was instantly able to “use the data to determine whether to move on with the lesson or whether to go back” (Black & Wiliam, 1998; Wiliam, 2015). Further benefits to the digital platform were the aspects of gamification it provided alongside the data-rich reports provided to teachers after each question had been asked which include:

Time taken to answer: this provides a good insight into whether children are simply recalling information or whether effortful thought is required to determine the right answer. The time taken to answer ranged from 5 seconds to 18 seconds on average.

Correct and incorrect answers as a % of the total class: This was probably the most useful top-level information as this provides me with the data needed to determine whether to move forward with the lesson.

The answers given by individual children: Once I had determined whether to move forward with the lesson or not, this was then next most useful piece of data as I could then tailor my intervention or support to the children whose answers highlighted a misconception.

Average score over time: Tracking children’s progress over time was a useful tool identifying how consistent my children were during the learning sequences. This could then be used as another tool for measuring how reliable my inferences were about children’s wider understanding and learning in science. That was not to say that just because a child’s average score increases consistently, that they were transferring more content to long term memory.

Furthermore, the Plickers app had other benefits to children’s learning: gamification has become one of the most notable technological developments for human engagement. Majuri, Koivisto and Hamari (2018) noted that “it is not surprising that gamification has especially been addressed and implemented in the realm of education where supporting and retaining engagement is a constant challenge.” (p. 11)

However, I am drawn back to the research from Module 1 by Agarwal, Nunes and Blunt in which they recognised that “when students engage in retrieval practice, a common concern is that they are simply learning the test questions and answers.” (2021, p.39) and the research of Sumeraki and Weinstein who noted the difference between learning and performance (2018), although the distinction was first set out by Bjork and Bjork (2011) and summarised in an integrative review by Soderstrom and Bjork (2015) In order to test whether my class were simply performing well during the lesson and had a high retrieval strength because the material was newly acquired, I used the same hinge questions as retrieval questions in later lessons, which showed mixed results.

Whilst in the Explore phase of this module, I conducted Attitude to Learning questionnaire with all children. This asked two simple questions:

1 Do you think you have made progress in your science learning since the beginning of year 5? Do you know more, can you remember more and can you do more?
2 Do you enjoy your science lessons?


I repeated the same questionnaire at the end of Spring 1. The table below show the comparable results:


Explore Phase

End of Spring 1


Net Agree Mean Net Agree


Do you think you have made progress in your science learning since the beginning of year 5? Do you know more, can you remember more and can you do more? 65% 3.7 70% 3.8 +5%


Do you enjoy your science lessons? 81% 4.2 90% 4.6 +9%


(5 strongly agree; 1 strongly disagree)

The challenges I have faced with the implementation of this Move have stemmed from the design of the hinge questions themselves, particularly “ensuring all distractors are plausible and address misconceptions (Wiliam, 2015; Kirby, 2014; Christodoulou 2016). Ensuring that the questions are designed to elicit instructionally tractable information is an art form that requires honing over several iterations. My first attempts were not specific enough to identify the misconceptions children held. For example, my first question (shown below) had more than one correct answer. With practice, this could be a useable question. The lesson was focussed on the connection between the Sun, Earth and Moon, therefore I wanted to ascertain whether the children knew the Moon orbits the Earth and should have kept this as the only correct answer.

Artefact 1:

Which is true about the Moon?

  1. It reflects light
  2. It orbits the Earth
  3. It can’t be seen during the day because there is too much light
  4. It has no gravity

Through making effective use of the work of Allen, which is grounded particularly in primary science, I was able to adapt and develop the distractors to ensure plausibility and that they address the misconceptions held.

The following question was taken from a lesson on night and day:

Which of the following statements is true?

  1. The Sun moves across the sky
  2. Earth moves at an inconsistent rate and speed
  3. Earth spins on its axis once in every 24-hour day
  4. The Earth orbits the Sun once a day

My understanding has not only developed in terms of the benefits that hinge questions can have to learning sequences, but in terms of how to even more effectively sequence instruction so that it is explicit enough to allow children to grasp the key objective of the lesson, particularly in the planning stages when identifying key concepts that students must understand to move forward (Willingham 2002; Wiliam, 2015). My knowledge of what a particular misconception tells you about a child’s level of understanding had grown hugely – this highlighted a particular area of focus for my own subject knowledge development.

Throughout the Implementation Phase of my Move, I have adapted and tweaked both the hinge questions themselves – particularly the specificity of the question and the quality of the distractor – and the point during the lesson at which they are asked. I found that asking a hinge question too late into the instruction allowed too much time to pass before identifying misconceptions. Just because a hinge question is planned at a particular point in the lesson does not mean that I must wait until this point in the lesson to address any misconceptions that arise. Furthermore, the model I identified worked in practice; again, one must not consider it a binary model of either/or, one must remain firm to the principles of quality first teaching and effective hinge questions, but flexible in the approach to implementing them during the lesson. This dynamic response has ensured maximum impact on children’s learning. As a result, I will continue to monitor the implementation, but do not consider that any major changes to the fundamental concept are required.

To best articulate the implementation of my Move, the following artefacts are used in support of this assignment:

  1. Snapshot of Plickers data report (anonymised) which show an analysis of data points.
  2. Example hinge question used as a retrieval question in following lesson, illustrative of the link between assessment and consolidation.

The positive impact seen of hinge questions in science led me to investigate whether assessing student performance and learning formatively using hinge questions could be easily transferred between subject disciplines, particularly in the primary setting as the variables are limited – that being, the class teacher is usually the same person delivering the learning sequences across multiple subject domains. Within this Move, I originally set out to explore whether the effective implementation of hinge questions in science would positively impact on children’s learning but have found that this assessment method can have similar effects in other subject areas too, namely English, particularly grammar, with a Year 6 greater depth booster group.

Whether assessing student learning using hinge questions in science or in grammar, the result showed varying degrees of success, but most shared one thing in common: the hinge questions revealed the underlying misconceptions children held about concepts – scientific or grammatical – which allowed me to better tailor my instruction to disabuse them of the said misconceptions.

From these two examples, I can generalise that the implementation of hinge questions as a form of in-lesson assessment has a positive impact on children’s learning. I know that this is not the exclusive factor in this positive improvement but can confidently say that it has not had a negative impact. Results showed that a significant majority of children answered the hinge question correctly over the course of the block of learning, allowing me to make a broad inference that the instruction around key concepts was clear.

However, this was not always the case as the table below shows:

Table 2: Plickers report sheet

As a result of this having this data at my fingertips, I was able to direct my support more strategically or as Black and Wiliam note, use the data to determine whether to move on or go back – for example, to students 11, 12, 25 and 30 in lesson one, who answered incorrectly. This immediate intervention provided these children with a short additional input to correct the error or misconception before continuing with the independent practice – which they would likely get wrong without the intervention. Without this data, I would not be able to do this until distance marking the children’s books at the end of the lesson, meaning they would not get the corrective input until the next lesson, which would be one week later.

The technology in this instance is supportive: it provides me with instant feedback on children’s performance. However, this information is only as good as the question used to elicit it. The technology used here should not be a limiting factor; teachers without access to Plickers could achieve a similar system through the use of whiteboards or another offline system, but consideration would need to be given to how to record and use the data effectively.

Moreover, what was of more interest was whether the children could recall the knowledge learned after the initial lesson took place. In essence, had sufficient knowledge been transferred to long term memory and, with some effort full thought, could this be retrieved to working memory later.

I analysed the results of end of unit science tests from the beginning of the academic year where hinge questions had not been utilised and compared them to the latest test where the instruction had made use of hinge questions.

Whereas I cannot say with certainty that hinge questions have been responsible for the progress seen from Autumn 1 to the current point in the academic year, they have not caused a decline in attainment, as indicated by the data set below:

Table 3: Summative test scores

Although it is too early to show whether children are getting better at science more generally, over the course of the academic year so far, the average raw score for each end of unit summative science test either increased or was maintained, with the number of children scoring in the band of Working Towards the Expected Standard decreasing between Autumn 1 and Spring 2. Further analysis of the test scripts saw that most children scored well on the multiple-choice questions and those directly related to knowledge recall, but some underperformed on the open questions where they were required to show their knowledge in support of or against a particular statement.

As a way of proving the correlation between the implementation of hinge questions and an overall increase in scientific understanding, I conducted content analysis which compared the percentage of children answering the hinge question correctly to the corresponding question in the summative test:

Table 4 – Hinge question analysis vs summative test analysis (content analysis).


Some generalisations can be drawn from the data set above:

  • Where the initial percentage of children getting the hinge question correct was high (i.e. more children got the hinge question correct) this transferred across to the summative test with marginal gains.
  • Where the initial percentage of children getting the hinge question correct was low (i.e. more children got the hinge question incorrect) a higher percentage got the corresponding question correct on the summative test.

This could be due to several factors:

  1. The instruction given following the hinge question being answered wrong was effective enough in correcting the error or misconception so that children did not repeat the same error again in subsequent tests.
  2. The planning of subsequent retrieval questions was influenced by the data gained from the previous hinge question.
  3. Pupils identified their own areas of weakness and revised more effectively based on this knowledge.

It could be that a combination of all the factors identified above played a part in children’s summative test data being as it is. We can only make inferences about learning; but what is clear is that hinge questions did not negatively impact on children’s outcomes.

Throughout this Move, I have reflected on my teaching practice and have noticed further observable changes in the way I plan, teach and assess children’s learning: I am more explicit about the knowledge and concepts I teach and the sequence thereof, still paying particular attention to “identifying key facts, knowledge and concepts” (Pashler et al, 2007) and continue to develop this skill in other subject areas, not just science. Further to this, I have paid particular attention to the distractors I use when devising hinge questions, ensuring that they provide me with useable data within the lesson which highlights children’s misconceptions. All distractors are plausible, and I avoid using choices that will not provide me with no new information. This showed me that Active Ingredient D was pivotal in implementing this Move effectively.

Having reflected on the other Active Ingredients selected for this Move, I still consider them to be the most appropriate to support this Move. I feel that the Move would not have been as effective if one of them was removed.

Arguably one of the most important Active Ingredients of my Move was to use a common misconception to construct a multiple-choice question that addresses the misconception and key concepts within the lesson. As a result of this focus, I have had to ensure my own subject knowledge is developed enough to be able to respond to the misconceptions that arise. This has resulted in further reading and clarification of current best practice within science. I feel that this would apply to any teacher, but particularly those in the early stages of their career or those who are not subject specialists.

Active Ingredient F, “obtain a response from every pupil”, could be seen as less crucial Ingredient.  However, the ethical ramifications of this ingredient are significant. It could be easy to assume that not all children need to access a hinge question because of their prior knowledge or prior performance in science lessons. This assumption denies these children the opportunity to show their knowledge. Furthermore, the fact that a child performed well on one aspect of science should not be an indicator of their overall science performance. As teachers, we must not make assumptions.

Even with the implementation of hinge questioning as a further formative assessment method, I can still only make inferences about children’s learning. However, this diagnostic tool provides me with instant instructionally tractable information is allowing me to make more reliable inferences about children’s learning. This coupled with a greater understanding of what assessment is and is not, means that I can adapt my teaching practice to focus more on what will have a greater impact – that being effective instruction, forensic questioning and responsive teaching.



Allen, M., (2020) Misconceptions in primary science. Open University Press.

Agarwal, P. K., Nunes, L. D., & Blunt, J. R. (2021). Retrieval Practice Consistently Benefits Student Learning: A Systematic Review of Applied Research in Schools and Classrooms. Educational Psychology Review.

Bjork, Elizabeth & Bjork, Robert. (2011). Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. Psychology and the Real World: Essays Illustrating Fundamental Contributions to Society. 56-64.

Christodoulou, D. (2017). Making good progress?: The future of assessment for learning. Oxford University Press

Coe, R., Rauch, C. J., Kime, S., & Singleton, D. (2020). Great teaching toolkit: evidence review.

Cronbach, L.J. (1971). Test validation. In R.L. Thorndike (Ed.), Educational measurement (2 ed., pp.443-507). Washington DC: American Council on Education.

Fletcher-Wood, H. (2018). Responsive teaching: cognitive science and formative assessment in practice. Routledge.

Heitink, M. C., Van der Kleij, F. M., Veldkamp, B. P., Schildkamp, K., & Kippers, W. B. (2016). A systematic review of prerequisites for implementing assessment for learning in classroom practice. Educational research review, 17, 50-62.

Majuri, J., Koivisto, J., & Hamari, J. (2018). Gamification of education and learning: A review of empirical literature. In Proceedings of the 2nd international GamiFIN conference, GamiFIN 2018. CEUR-WS.

Soderstrom, N. C., & Bjork, R. A. (2015). Learning versus performance: An integrative review. Perspectives on Psychological Science, 10(2), 176-199.

Sumeracki, M. A., & Weinstein, Y. (2018). Optimising Learning Using Retrieval Practice. Impact.

Taras, M. (2010). Assessment for learning: assessing the theory and evidence. Procedia-Social and Behavioral Sciences, 2(2), 3015-3022.

Wiliam, D. (2011). Embedded formative assessment. Solution Tree Press.

Wiliam, D. (2011). What is assessment for learning?. Studies in educational evaluation, 37(1), 3-14.

Wiliam, D. (2015). Designing Great Hinge questions. Educational Leadership, 73(1), 40-44.

Wiliam, D., & Black, P. (1996). Meanings and consequences: a basis for distinguishing formative and summative functions of assessment?. British educational research journal, 22(5), 537-548.

Wiliam, D. (2014). Principled assessment design. Redesigning Schooling-8. London: SSAT (The School Network) Ltd, 2-97.



Appendix 1: Active Ingredients & Move Plan

1a) Active Ingredients

A Identify key concepts that students must understand to move forward (Willingham 2002; Wiliam, 2015)
B Identify common misconceptions that arise around the key concept (Fletcher-Wood, 2018)
C Use common misconceptions to construct a multiple-choice questions that addresses this misconception & key concepts with a lesson (Fletcher-Wood, 2018)
D Ensure all distractors are plausible and address misconceptions (Wiliam, 2015; Kirby, 2014; Christodoulou, 2016)
E Identify a key hinge-point within a lesson where the hinge question will occur.

(Wiliam, 2015; Fletcher-Wood, 2018)

F Obtain a response from every pupil (Fletcher-Wood, 2018)
G Use the data to determine whether to move on with the lesson or whether to go back (Black & Wiliam, 1998; Wiliam, 2015)


1b) Move Plan

  1. Before the lesson, identify common misconceptions that may arise and identify the key hinge-point within the lesson.
  2. Create a multiple-choice question which incorporates these misconceptions and offers plausible distractors.
  3. Script the discussion that will be used in Step 6.
  4. Introduce the hinge-question to the class and explain how the students should respond. If you think the answer is A, hold up your Plickers card with the A at the top. If you think the answer is B, hold up your Plickers card with the B at the top. If you think the answer is C, hold up your Plickers card with the C at the top.
  5. Scan the classroom with the iPad looking at every child’s response.
  6. Use the script from Step 3 to discuss the answer to the hinge questions.

Appendix 2: Artefacts

2a) Plickers report

2b) Do Now Front Sheet detailing hinge question used for retrieval