As I predicted, the next big thing after DALL-E and MidJourney and Stable Diffusion is a synthetic video engine, Sora. And yes, it is good. By which I mean incredibly detailed in its rendering of image elements. By which I mean extremely accurate in predicting how elements move from frame to frame. By which I mean highly realistic, depending on where you focus your gaze (hands, feet and distant figures are still a work in progress).
We don’t yet have the model on open release, though like ChatGPT I imagine Sora will be pushed out to users as soon as basic red-team challenges have been passed, without addressing any real issues of robustness, safety, copyright infringement, bias, labour rights, ecological sustainability or systemic harms. A recent brief from Data and Society, by the way, notes that redteaming is not appropriate for ‘assessing nuanced sociotechnical vulnerabilities’, particularly when ‘the process and system are closed to outsiders’ Sora’s development has been even more closed than GPT4’s, but if that experience has taught tech anything, it is to build the user base first, and deal with the vulnerabilities later.
We do know some things. As large language models are trained on ‘word embeddings’, video models are trained on ‘spacetime patches’ from vast amounts of (as yet undisclosed) video and gaming world data. This makes them highly believable when we attend to the parts, but fragile and unreliable when it comes to the bigger picture. Sora has no underlying model of physics (how things behave in the real space-time world), and no understanding of cause and effect (why stories unfold as they do). It builds each frame by approximating from the last, which itself is a thing of patches, matched to textual cues.
The size of video files as compared with images or text gives some sense of the world-heating quantities of data and computation required to produce these glossy, glitchy memes. A hidden army of annotators and adjudicators will no doubt be needed to dial down the politically troubling and pornographic outputs, and dial up the cute animals. No doubt, too, we will be treated to more nonsense about how synthetic video shows an ‘emergent’ understanding of the world, just as we were for text models. Because pet videos and promo-porn don’t provide the kind of noble purpose that could justify the human and natural resources being thrown at the enterprise.
In fact, computer scientists who try to build general models of the world - like Francois Chollet, AI engineer at Google, for example - know that Sora doesn’t get anywhere close to solving the problems:
There is, however, a self-fulfilling aspect to these synthetic models. Because they are not models of the real world. They are trained on, then integrated back into, the world of digital content. We know their data is biased in ways that are tied to specific injustices, and harmful to real people. But digital media is also biased in a more general sense: towards what works online. Synthetic text is built from the kind of algorithmically-optimised text that gratifies online readers. And as LLM outputs make up more of the online textual world, and reshape search algorithms, human-generated text may be less favoured by users and consequently harder to find.
The same goes for images. There is evidence that viewers find AI-generated faces more trustworthy and ‘human’ than images of real people, for example (though there is a racial dimension to this finding). Being online means interacting with avatars, bots, and images that have been enhanced in subtle and less subtle ways, which is why it can be so hard for young people to negotiate their relationships with real others and their own unenhanced selves. AI promises more, far more, of the same.
When it comes to video, it seems likely that Sora was trained on data from video games and game engines as well as live action films and social media clips. But supposedly ‘real’ video is also shaped by CGI values and game world aesthetics and post-production techniques. Video content doesn’t accurately reflect the real world any more than online faces do: it reflects an optimised and perfected, or hyper-real and exaggerated one. (It’s worth remembering that the current generative AI surge was made technically possible by demands for greater realism in video games, leading to the development of ever-more advanced parallel processing GPUs, creating the chip-making monster that is NVIDIA and enabling the pivot to AI data munching that NVIDIA made in the mid 2010s.)
Synthetic media are fabrications of digital content, not representations of the ‘real world’. They are built from whatever biases, profit motives and user compulsions drive the production of that content. Their business is to enclose more and more of it, and to give users fewer and fewer alternatives or exit points. They don’t have to get better at modelling some ‘other’ reality, only make their self-contained reality more compelling, and persuade users to spend more time there.
What could possibly go wrong?
At the interface with the real world, however, political actors are heavily invested in video synthesis. A recent report found that all the major political parties in India have AI teams to produce and anonymously circulate deepfake video. Other states are just less open about it. Sora arrives into the most consequential election year imaginable, when how a few powerful men project their personas onto our screens will determine the future of the planet. It arrives into several war zones that, for most people, can only be glimpsed through the mobile phone cameras of actors on the ground.
It is not that Sora has a unique capacity for on-screen fakery. It’s just that its promise of accelerated fakes will accelerate the decline of belief in any shared visual reality. A recent study of public discourse about the Russia-Ukraine war, for example, found that distrust of authentic video material was far more prevalent than a belief in fakes. Worse:
efforts to raise awareness around deepfakes may undermine trust in legitimate videos. Consequentially, news media and governmental agencies need to weigh the benefits of educational deepfakes and pre-bunking against the risks of undermining truth.
I am not sure ‘truth’ is quite so here-and-gone as this suggests. Truth has always belonged more securely to the people with the best technologies, and the power to broadcast their versions. But synthetic and fake media further undermine the (already diminished) democratic possibilities of networks – the ways in which truth-telling has, for a while, been more available to people with fewer technical means.
In education as in news. Gary Marcus points out that a proliferation of fun-to-watch but inaccurate videos that violate the laws of physics, or animal anatomy, are going to present a challenge to open education and to young people’s habits of learning from video.
The ants’ nest meme - let’s recall that it was chosen to demonstrate the power of Sora - features ants that are both anatomically and behaviourally wrong. Someone, somewhere, will say that there is educational value here. Spot the errors! Endless learning! There will certainly be people who spend precious intellectual resources tracking these errors as they proliferate online. But the people doing that will already have a secure foundation in their specialist corner of the world, whether that is physics or history, the streets of Tokyo or insect behaviour. ‘What the AI got wrong’ should be, at best, a fun end-of-term quiz: it can’t be the actual syllabus.
The same pressures of time and productivity that are pushing academics towards text generation may also make synthetic video attractive. Pre-Sora - and how I wish that did not sound nostalgic already - some educators were using apps like DeepReel, D-iD and Runway to generate video/voice avatars of themselves, and plugging in transcripts to produce instant ‘lectures’, complete with realistic expressions and hesitations. I’m pretty sure these were ‘what if?’ exercises rather than actual teaching resources. But ‘what if?’ is coming up faster all the time. What if you could plug in ChatGPT to a video avatar to generate new content literally on cue? Or use a celebrity avatar (I don’t know, Taylor Swift perhaps) to deliver your lecture? Try not to imagine the ways such an arrangement could be used to abuse lecturers, scam students, and exploit anyone with a voice and likeness online. Just think of the bums on seats that could be served. Deepfake pedagogy, anyone?
Reasons to be hopeful
And yet…
The pushback that I saw online against these prototype ‘video avatar teachers’ tells me something about the teaching power of video. Video and image fakes produce a visceral reaction. We may be used to words offering ‘alternative truths’, at least from strangers, but we feel panicky and disoriented if we can’t believe the evidence of our ears and eyes. This is survival stuff. I already find synthetic images useful for illustrating some of problems with text generation, and I think video is going to be even more valuable. I’m not saying the different media models are exactly the same. The text-to-text-to-more-text prompt experience is unique to language models, and is why they are becoming the backbone of all the compelling new interfaces - including with video and voice avatars - that are really the main business proposition now. But image and video illustrate vividly the problems of bias in all the underlying data. They can also show up (as in the case of Gemini’s failed guardrails) the crudeness and dishonesty of post-hoc attempts to engineer the bias away.
Also, I’m not sure why, it seems easy for people to make the link between synthetic ‘art’ and all the actors, film makers and artists who will no longer be able to make a living. The ‘wow’ response is followed almost immediately by questions about the future of visual culture. Perhaps it’s harder to appreciate just how crap, how deadly dull, how utterly bereft of humanity all those AI-produced books and poems and narratives are, because you can just avoid reading them. Or perhaps text workers are less glamorously ‘creative’ than artists. I don’t care. I’ll take the effect and I’ll use it to ask questions about how text work is being devalued too, and how that might diminish the rewards of working in a whole host of professions that are not obviously ‘creative’.
In education, the limitations of synthetic images make a great way in to talking about the limitations of generating text. But I think the critical frame has to open out beyond the details of ‘what the AI gets wrong’. Detail - word parts, pixels, patches - is what synthesis works with and where it impresses most. Learning is not the accumulation of detail. It is constructing a domain of knowledge in its orderliness, its known disorderliness, its core concepts, its particular theories and methods and values. Also its limitations and structural biases. Even if knowledge is flexible and contingent, it is, in an important way, more concise than the world it refers to. It is generative (truly generative) of new responses in new situations. It has levels of coherence that are more than just the sum of local correlations. It is personal: it becomes part of the self seeing the world, not just the world being seen. This is why we can work out some concise (if contingent) rules for making our way in the world when we are infants, and don’t then have to boil our heads with data every time we go to sit on a chair.
Only five minutes ago, educators were being urged to get around student use of synthetic text by setting more ‘innovative’ assignments, such as videos and presentations. Some of us pointed out that this would work for about five minutes, and here we are. The medium is not the assignment. The assignment is the work of its production. This is already enshrined in many practices of university assessment, such as authentic assessment (a resource from Heriot Watt University), assessment for learning (a handy table from Queen Mary’s UL) and assessing the process of writing (often from teaching English as a second language, e.g. this summary from the British Council). The generative AI surge has prompted a further shift towards these methods: I’ve found some great resources recently at the University of Melbourne and the University of Monash.
But all these approaches require investment in teachers. Attending to students as meaning-making people, negotiating authentic assessments, giving feedback on process, and welcoming diversity: these are very difficult to ‘scale’. And in all but a few universities, funding per student is diminishing. So instead there is standardisation, and data-based methods to support standardisation, and this has turned assessment into a process that can easily be gamed. If the pressures on students to auto-produce assignments are matched by pressures on staff to auto-detect and auto-grade them, we might as well just have student generative technologies talk directly to institutional ones, and open a channel from student bank accounts directly into the accounts of big tech while universities extract a percentage for accreditation.
The bigger picture is not a bigger context window for generative technologies, but is connecting ideas about the world with the world, and testing them using disciplinary methods. Against the tide of technical fixes, attention could be directed instead to how particular communities construct knowledge (contingently, fallibly), and how learners adopt those practices themselves, developing a personal repertoire, and feeling safe enough to get things wrong. If universities can offer this pedagogic opportunity and challenge, I think they can answer young people’s increasingly anxious questions about why they are there.
Some excellent points, as ever. Amongst many things you have raised here, Taylor Swift giving a lecture reminded me of a university many years ago discussing the possibility of employing actors to give lectures to improve module feedback evaluations. #WatchThisSpace
Excellent meditation, Helen.
Question about "it’s harder to appreciate just how crap, how deadly dull, how utterly bereft of humanity all those AI-produced books and poems and narratives are, because you can just avoid reading them" - are you seeing any signs of human creators labeling their work AI-free?