A Picture is Worth

In its July 1, 2017, issue, The Economist discusses a fascinating new process in which computers generate realistic videos of fictional events.  The specific example discussed in the article is an effort by a German artist, Mario Klingemann.  He began with an audio recording of a dialogue between Kellyanne Conway (President Trump's advisor) and Chuck Todd (an NBC television journalist).  He then combined that audio track with a black-and-white television clip of a music performance by Françoise Hardy, a French pop singer who rose to prominence in the 1960's.  The result is that it looks like Miss Hardy is speaking with Mrs. Conway's voice.

The clip is quite pixellated and jumpy, with sudden changes in lighting and coloring that look abnormal.  You might explain these irregularities by the fact that the video was recorded on relatively primitive technology fifty years ago — but even so, something about it looks off.  The clip simply does not have the usual indicia of reliability that we expect from video recordings.  And as a result, you suspect at a glance that it might be a fake.

As The Economist notes, it is already possible to generate 'fake video' much more realistic than Mr. Klingemann's example — think, for example, of the amazing computer-generated imagery and special effects common in Hollywood blockbusters.  However, that kind of persuasive realism requires hundreds of hours of expert technical effort, using very powerful computers.  That fakery is still very expensive and still essentially the result of human artistry.

What is so interesting about Mr. Klingemann's clip is that it was created by a computer with barely any human input, using a new technique: generative-adversarial networks (or "GAN's").

GAN's were introduced in 2014 by Ian Goodfellow... who observed that, although deep learning allowed machines to discriminate marvelously well between different sorts of data (a picture of a cat vs. one of a dog, say), software that tried to generate pictures of dogs or cats was nothing like as good.  It was hard for a computer to work through a large number of training images in a database and then create a meaningful picture from them.
Mr Goodfellow turned to a familiar concept: competition.  Instead of asking the software to generate something useful in a vacuum, he gave it another piece of software — an adversary — to push against.  The adversary would look at the generated images and judge whether they were “real,” meaning similar to those that already existed in the generative software’s training database.  By trying to fool the adversary, the generative software would learn to create images that look real, but are not.  The adversarial software, knowing what the real world looked like, provides meaning and boundaries for its generative kin.
Today, GAN's can produce small, postage-stamp-sized images of birds from a sentence of instruction. Tell the GAN that "this bird is white with some black on its head and wings, and has a long orange beak," and it will draw that for you.  It is not perfect, but at a glance the machine's imaginings pass as real.

Of course, there's nothing new about fake photos.  Images have been manipulated from the very beginning of photography, though in general the time and trouble it required, combined with the difficulty of achieving a truly convincing fake, preserved photography's authority to some extent.  Sure you might be able to remove a political enemy from your photo... but how realistic will it look?  As a result, we tended to trust the accuracies of photos.

But now it is becoming much easier to achieve convincing fakes — and not just in doctoring pre-existing photos or videos, but even in creating them from whole cloth.  While we are not yet at a point where computers can do that quickly and cheaply, the consensus is that we are not many years away.  It is interesting to consider how our collective attitudes towards the authority and reliability of video evidence will change once fakery is easily accomplished.  

In a certain sense, this is reminiscent of the decline of the printed word's authority which has accompanied the rise of modern computing.  Typesetting was sufficiently cumbersome and difficult that it generally wasn't done without a lot of forethought — research, training, fact-checking, etc.  That generated an expectation of reliability.  Simply being in print was obviously no guarantee of accuracy (especially in heated partisan matters), but there was something like a rebuttable presumption of authority.

But now the ease with which we can create text (and 'publish' it on the Internet) has undermined that assumption.  Consider how easily Wikipedia entries are created and edited — and so the basis for their authority is completely different from their print forerunners.  Textual authority once derived, at least in part, from the significant 'barriers to entry' of typesetting.  But Wikipedia derives its (admittedly modest) authority precisely from the fact there are almost no 'barriers to entry' for its editors.  Anyone who sees an error on Wikipedia can correct it, and with a large enough pool of editors, we assume the errors have been caught, and that the text is accurate.  Whether or not crowd-sourced fact-checking ultimately suffices, it is a head-spinning 360º from the model we had for hundreds of years.

In a similar fashion, computer-generated video may return us to a pre-photographic attitude towards pictorial representation.  In the centuries before photography, we depended upon artists to depict people, places and events for us — and we processed those depictions mindful of confounding factors like the artist's purposes, reputation, skill, etc.

Consider this drawing (right) from the Library of Congress; it was drawn by someone named Robert Griffin around 1856, and it depicts a meeting of the Senate of Liberia.  In a certain sense, it fulfills the same function as a photo of the same moment would fulfill — it tells us who was present, where they were standing, what the place generally looked like.  Of course, the depictions are less realistic than we would expect from a photo; it would be difficult to recognize one of these people on the street if we only had this drawing to go by.  But in many respects it resembles what a photo of this event would look like.

But this treats pre-photographic pictorial representations as ur-photographs – as if this drawing were an imperfect, undeveloped, primitive photograph.  This attitude is obviously wrong.  It forgets many principles we relied upon before photographs became prevalent.  As a photo, this would depict a specific moment in time – but should we assume that of the drawing?  Perhaps Mr. Griffin included people present at different times during the Senate's session?  Or consider a detail like the Senator shown with a raised hand.  Is Mr. Griffin claiming that no one else had a hand raised at the same time?  Was that even part of his considerations when he executed the drawing?  The drawing's perspective is that of a spectator in a raised gallery.  Does such a gallery exist?  Or did Mr. Griffin just imagine it?

Thanks to the coming rise of realistic, easily-produced computer-generated images, we will have to re-learn these questions.  They will have to be intuitive and commonly asked.  How long will it take society to incorporate these pre-photographic questions into our post-photographic world?  For the past 100 years we have expected that photos and videos are pretty fair representations of reality, unless someone has gone to an awful lot of trouble to fake them.  There was a presumption of reliability and credibility that gave audio and video a very powerful sway over us.  What will it be like, this world in which audio and video have as little reliability as something typed on a random webpage?  The past 100 years will seem like a brief dream, a felicitous world when we could trust what we saw.  Will we miss it?

If you wish to discuss this post with me, I'd welcome receiving an email from you.  Please email me at language.on.holiday@gmail.com.

Language on Holiday