Simplicity is the ultimate sophistication

Video Composition With iOS

| Comments

Why video composition

You may think that video composition should be limited to applications like iMovie or Vimeo so you can consider this subject, at least from the point of view of the developer, to be limited to a niche of video experts. Instead it can be extended to a broader range of applications, not essentially limited to practical video editing. In this blog I will provide an overview of the AV Foundation framework applied on a practical example.

In my particular case the challenge was to build an application that, starting from a set of existing video clips, was able to build a story made by attaching a subset of these clips based on decisions taken by the user during the interaction with the app. The final play is a set of scenes, shot on different locations, that compose a story. Each scene consists of a prologue, a conclusion (epilogue) and a set of smaller clips that will be played by the app based on some user choices. If the choices are correct, then the user will be able to play the whole scene up to its happy end, but in case of mistakes the user will return to the initial prologue scene or to some intermediate scene. The diagram below shows a possible scheme of a typical scene: one prologue, a winning stream (green) a few branches (yellow are intermediate, red are losing branches) and an happy end. So the user somewhere in TRACK1 will be challenged to take a decision; if he/she is right then the game will continue with TRACK2, if not it will enter in the yellow TRACK4, and so on.

What I have in my hands is the full set of tracks, each track representing a specific subsection of a scene, and a storyboard which gives me the rules to be followed in order to build the final story. So the storyboard is made of the scenes, of the tracks the compose each scene and of the rules that establish the flow through these tracks. The main challenge for the developer is to put together these clips and play a specific video based on the current state of the storyboard, then advance to the next, select a new clip again and so on: all should be smooth and interruptions limited. Besides the user needs to take his decisions by interacting with the app and this can be done by overlapping the movie with some custom controls.

The AV Foundation Framework

Trying to reach the objectives explained in the previous paragraph using the standard Media Framework view controllers, MPMoviePlayerController and MPMoviePlayerViewController, would be impossible. These conrollers are good to play a movie and provide the system controls, with full-screen and device rotation support, but absolutely not for advanced controls. Since the release of iPhone 3GS the camera utility had some trimming and export capabilities, but these capabilities were not given to developers through public functions of the SDK. With the introduction of iOS 4 the activity done by Apple with the development of the iMovie app has given the developers a rich set of classes that allow full video manipulation. All these classes have been collected and exported in a single public framework, called AV Foundation. This framework exists since iOS 2.2, at that time it was dedicated to audio management with the well known AVAudioPlayer class, then it has been extended in iOS 3 with the AVAudioRecorder and AVAudioSession classes but the full set of features that allow advanced video capabilities took place only since iOS 4 and they were fully presented at WWDC 2010.

The position of AV Foundation in the iOS Frameworks stack is just below UIKit, behind the application layer, and immediately above the basic Core Services frameworks, in particular Core Media which is used by AF Foundation to import basic timing structures and functions needed for media management. In any case you can note the different position in the stack in comparison with the very high-level Media Player. This means that this kind of framework cannot offer a plug-and-play class for simple video playing but you will appreciate the high-level and modern concepts that are behind this framework, for sure we are not at the same level of older frameworks such as Core Audio.


(image source: from Apple iOS Developer Library)

Building blocks

The classes organization of AV Foundation is quite intuitive. The starting point and main building block is given by AVAsset. AVAsset represents a static media object and it is essentially an aggregate of tracks which are timed representation of a part of the media. All tracks are of uniform type, so we can have audio tracks, video tracks, subtitle tracks, and a complex asset can be made of more tracks of the same type, e.g. we can have multiple audio tracks. In most cases an asset is made of an audio and a video track. Note that AVAsset is an abstract class so it is unrelated to the physical representation of the media it represents; besides creating an AVAsset instance doesn’t mean that we have the whole media ready to be played, it is a pure abstract object.

There are two concrete asset classes available: AVURLAsset, to represent a media in a local file or in the network, and AVComposition (together with its mutable variant AVMutableComposition) for an asset composed by multiple media. To create an asset from a file we need to provide its file URL:

1
2
NSDictionary *optionsDictionary = [NSDictionary dictionaryWithObject:[NSNumber numberWithBool:YES] forKey:AVURLAssetPreferPreciseDurationAndTimingKey];
AVURLAsset *myAsset = [AVURLAsset URLAssetWithURL:assetURL options:optionsDictionary];
The options dictionary can be nil, but for our purposes - that is making a movie composition - we need to calculate the duration exactly and provide random access to the media. This extra option, that is setting to YES the AVURLAssetPreferPreciseDurationAndTimingKey key, could require extra time during asset initialization, and this depends on the movie format. If this movie is in QuickTime or MPEG-4 then the file contains additional summary information that cancels this extra parsing time; but the are other formats, like MP3, where this information can be extracted only after media file decoding, in such case the initialization time is not negligible. This is a first recommendation we give to developers: please use the right file format depending on the application.
In our application we already know the characteristics of the movies we are using, but in a different kind of application, where you must do some editing from user imported movies, you may be interested in inspecting the asset properties. In such case we must remember the basic rule that initializing an asset doesn’t mean we loaded and decoded the whole asset in memory: this means that every property of the media file can be inspected but this could require some extra time. For completeness we simply introduce the way asset inspection can be done leaving the interested user to the reference documentation (see the suggested readings list at the end of this post). Basically each asset property can be inspected using an asynchronous protocol called AVAsynchronousKeyValueLoading which defines two methods:

1
2
- (AVKeyValueStatus)statusOfValueForKey:(NSString *)key error:(NSError **)outError
- (void)loadValuesAsynchronouslyForKeys:(NSArray *)keys completionHandler:(void (^)(void))handler

The first method is synchronous and immediately returns the knowledge status of the specified value. E.g. you can ask for the status of “duration” and the method will return one of these possible statuses: loaded, loading, failed, unknown, cancelled. In the first case the key value is known and then the value can be immediately retrieved. In case the value is unknown it is appropriate to call the loadValuesAsynchronouslyForKeys:completionHandler: method which at the end of the operation will call the callback given in the completionHandlerblock, which in turn will query the status again for the appropriate action.

Video composition

As I said at the beginning, my storyboard is made by a set of scenes and each scene is composed by several clips whose playing order is not known a priori. Each scene behaves separately from the others so we’ll create a composition for each scene. When we get a set of assets, or tracks, and from them we build a composition all in all we are creating another asset. This is the reason why the AVComposition and AVMutableComposition classes are infact subclasses of the base AVAsset class.
You can add media content inside a mutable composition by simply selecting a segment of an asset, and adding it to a specific range of the new composition:

1
- (BOOL)insertTimeRange:(CMTimeRange)timeRange ofAsset:(AVAsset *)asset atTime:(CMTime)startTime error:(NSError **)outError

In our example we have a set of tracks and we want to add them one after the other in order to generate a continous set of clips. So the code can be simply written in this way:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
    AVMutableComposition = [AVMutableComposition composition];
    CMTime current = kCMTimeZero;
    NSError *compositionError = nil;
    for(AVAsset *asset in listOfMovies) {
        BOOL result = [composition insertTimeRange:CMTimeRangeMake(kCMTimeZero, [asset duration])
                                           ofAsset:asset
                                            atTime:current
                                             error:&compositionError];
        if(!result) {
            if(compositionError) {
		// manage the composition error case
            }
        } else {
            current = CMTimeAdd(current, [asset duration]);
        }
    }
First of all we introduced the time concept. Note that all media have a concept of time different than the usual. First of all time can move back and forth, besides the time rate can be higher or lower than 1x if you are playing the movie in slow motion or in fast forward. Besides it is considered more convenient to represent time not as floating point or integer number but as rational numbers. For such reason Core Media framework provides the CMTime structure and a set of functions and macros that simplify the manipulation of these structures. So in order to build a specific time instance we do:
1
CMTime myTime = CMTimeMake(value,timescale);
which infact specifies a number of seconds given by value/timescale. The main reason for this choice is that movies are made of frames and frames are paced at a fixed ration per second. So for example if we have a clip which has been shot at 25 fps, then it would be convenient to represent the single frame interval as a CMTime variable set with value=1 and timescale=25, corresponding to 1/25th of second. 1 second will be given by a CMTime with value=25 and timescale=25, and so on (of course you can still work with pure seconds if you like, simply use the CMTimeMakeWithSeconds(seconds) function). So in the code above we initially set the current time to 0 seconds (kCMTimeZero) then start iterating on all of our movies which are assets in . Then we add each of these assets in the current position of our composition using their full range ([asset duration]). For every asset we move our composition head (current) for the length (in CMTime) of the asset. At this point our composition is made of the full set of tracks added in sequence. We can now play them.

Playing an asset

The AVFoundation framework doesn’t offer any built-in full player as we are used to see with MPMovieViewController. The engine that manages the playing state of an asset is provided by the AVPlayer class. This class takes care of all aspects related to playing an asset and essentially it is the only class in AV Foundation that interacts with the application view controllers to keep in sync the application logic with the playing status: this is relevant for the kind of application we are considering in this example, as the playback state may change during the movie execution based on specific user interactions in specific moments inside the movie. However we don’t have a direct relation between AVAsset and AVPlayer as their connection is mediated by another class called AVPlayerItem. This class organizations has the pure purpose to separate the asset, considered as a static entity, from the player, purely dynamic, by providing an intermediate object, the that represent a specific presentation state for an asset. This means that to a given and unique asset we can associate multiple player items, all representing different states of the same asset and played by different players. So the flow in such case is from a given asset create a player item and then assign it to the final player.

1
2
AVPlayerItem *compositionPlayerItem = [AVPlayerItem playerItemWithAsset:composition];
AVPlayer *compositionPlayer = [AVPlayer playerWithPlayerItem:compositionPlayerItem];
In order to be rendered on screen we have to provide a view capable of rendering the current playing status. We already said that iOS doesn’t offer an on-the-shelf view for this purpose, but what it offers is a special CoreAnimation layer called AVPlayerLayer. Then you can insert this layer in your player view layer hierarchy or, as in the example below, use this layer as the base layer for this view. So the suggested approach in such case is to create a custom MovieViewer and set AVPlayerLayer as base layer class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// MovieViewer.h

#import <UIKit/UIKit.h>
#import <AVFoundation/AVFoundation.h>
@interface MovieViewer : UIView {
}
@property (nonatomic, retain) AVPlayer *player;
@end

// MovieViewer.m

@implementation MovieViewer
+ (Class)layerClass {
    return [AVPlayerLayer class];
}
- (AVPlayer*)player {
    return [(AVPlayerLayer *)[self layer] player];
}
- (void)setPlayer:(AVPlayer *)player {
    [(AVPlayerLayer *)[self layer] setPlayer:player];
}
@end

// Intantiating MovieViewer in the scene view controller
// We suppose "viewer" has been loaded from a nib file
// MovieViewer *viewer
[viewer setPlayer:compositionPlayer];

At this point we can play the movie, which is quite simple:

1
[[view player] play];

Observing playback status

It is relevant for our application to monitor the status of the playback and to observe some particular timed events occurring during the playback.
As far as status monitoring, you will follow the standard KVO based approach by observing changes in the status property of the player:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// inside the SceneViewController.m class we'll register to player status changes
[viewer.player addObserver:self forKeyPath:@"status" options:NSKeyValueObservingOptionNew context:NULL];

// and then we implement the observation callback
-(void)observeValueForKeyPath:(NSString *)keyPath ofObject:(id)object change:(NSDictionary *)change context:(void *)context {
    if(object==viewer.player) {
        AVPlayer *player = (AVPlayer *)object;
        if(player.status==AVPlayerStatusFailed) {
	  // manage failure 
        } else if(playe.status==AVPlayerStatusReadyToPlay) {
	  // player ready: manage success state (e.g. by playing the movie)
        } else if(player.status==AVPlayerStatusUnknown) {
	  // the player is still not ready: manage this waiting status
        }
    }
}
Differently from the KVO-observable properties timed-events observation is not based on KVO: the reason for this is that the player head moves continuously and usually playing is done on a dedicated thread. So the system certainly prefers to send its notifications through a dedicated channel, that in such case consists in a block-based callback that we can register to track such events. We have two ways to observe timed events:

  • registering for periodic intervals notifications
  • registering when particular times are traversed

In both methods the user will be able to specify a serial queue where the callbacks will be dispatched to (and it defaults to the main queue) and of course the callblack block. It is relevant to note the serial behaviour of the queue: this means that all events will be queued and executed one by one; for frequent events you must ensure that these blocks are executed fast enough to allow the queue to process the next blocks and this is especially true if you’re executing the block in the main thread, to avoid the application to become unresponsive. Don’t forget to schedule this block to be run in the main thread if you update the UI.
Registration to periodic intervals is done in this way, where we ask for a 1 second callback whose main purpose will be to refresh the UI (typically updating a progress bar and the current playback time):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// somewhere inside SceneController.m
id periodicObserver = [viewer.player addPeriodicTimeObserverForInterval:CMTimeMakeWithSeconds(1.0) queue:NULL usingBlock:^(CMTime time){
	[viewer updateUI];
}];
[periodicObserver retain];

// and in the clean up method
-(void)cleanUp {
  [viewer.player removeTimeObserver:periodicObserver];
  [periodicObserver release];
}

// inside MovieViewer.m
-(void)updateUI {
   // do other stuff here
   // ...
   // we calculate the playback progress ratio by dividing current position of playhead into the total movie duration
   float progress = CMTimeGetSeconds(player.currentTime)/CMTimeGetSeconds(player.currentItem.duration);
   // then we update the movie viewer progress bar  
   [progressBar setProgress:progress];
}

Registration to timed events is done using a similar method which takes as argument a list of NSValue representations of CMTime (AVFoundation provides a NSValue category that adds CMTime support to NSValue):

1
2
3
4
5
6
7
8
9
10
// somewhere inside SceneController.m
id boundaryObserver = [viewer.player addBoundaryTimeObserverForTimes:timedEvents queue:NULL usingBlock:^{
	[viewer processTimedEvent];
}];
[boundaryObserver retain];

// inside MovieViewer.m
-(void)processTimedEvent {
   // do something in the UI
}

In both cases we need to unregister and deallocate somewhere in our scene controller the two observer opaque objects; we may suppose the existence of a cleanup method that will be assigned this task:

1
2
3
4
5
6
-(void)cleanUp {
  [viewer.player removeTimeObserver:periodicObserver];
  [periodicObserver release];
  [viewer.player removeTimeObserver:boundaryObserver];
  [boundaryObserver release];
}

While this code is the general way to call an event, in our application it is more appropriate to assign to each event a specific action, that is we need to customize each handling block. Looking at the picture below, you may see that at specific timed intervals inside each of our clips we assigned a specific event.

The figure is quite complex and not all relationships have been highlighted. Essentially what you can see is the “winning” sequence made of all green blocks: they have been placed consecutively in order to avoid the playhead jumping to different segments when the player takes the right decisions, so the playback will continue without interruption and will be smooth. With the exception of the prologue track, which is just a prologue of the history and no user interaction is required at the stage, and is corresponding conclusion, simply an epilogue when the user is invited to go to the next scene, all other tracks have been marked by a few timed events, identified with the dashed red vertical lines. Essentially we have identified 4 kind of events:

  • segment (clip) starting point: this will be used as a destination point for the playhead in case of jump;
  • show controls: all user controls will be displayed on screen, user intercation is expected;
  • hide controls: all user controls are hidden, and no more user interaction is allowed;
  • decision point, usually coincident with the hide controls event: the controller must decide which movie segment must be played based on the user decision.

Note that this approach is quite flexible and in theory you can any kind of event, this depends on the fantasy of the game designers. From the point of view of the code, we infact subclassed the AVURLAsset by adding an array of timed events definitions. At the time of the composition creation, this events will be re-timed according to the new time base (e.g.: if an event is played at 0:35 seconds of a clip, but the starting point of the clip is exactly at 1:45 of the entire sequence, the the event must be re-timed to 1:45 + 0:35 = 2:20). At this point, with the full list of events we can re-write our boundary registration:

1
2
3
4
5
6
7
8
9
10
11
12
    // events is the array of all re-timed events in the complete composition
    __block __typeof__(self) _self = self; // avoids retain cycle on self when used inside the block
    [events enumerateObjectsUsingBlock:^(id obj, NSUInteger idx, BOOL *stop) {
        TimedEvent *ev = (TimedEvent *)obj;
        [viewer.player addBoundaryTimeObserverForTimes:[NSArray arrayWithObject:[NSValue valueWithCMTime:ev.time]]
                                                                   queue:dispatch_get_main_queue()
                                                              usingBlock:^{
                                                                  // send event to interactiveView
                                                                  [viewer performTimedEvent:ev];
								  [_self performTimedEvent:ev];
                                                              }];
    }];

As you can see the code is quite simple: for each timed event we register a single boundary which simply calls two methods, one for the movie viewer and one for the scene controller; in both cases we send the specific event so the receiver will know exactly what to do. The viewer will normally take care of UI interaction (it will overlay a few controls on top of the player layer, so according to the events these controls will be shown or hidden; besides the viewer knows which control has been selected by the user) while the scene controller will manage the game logic, especially in the case of the decision events. When the controller finds a decision event, it must move the playhead to the right position in the composition:

1
2
3
4
5
6
7
8
9
   CMTime goToTime = # determines the starting time of the next segment #
    [viewer hide];
    [viewer.player seekToTime:goToTime toleranceBefore:kCMTimeZero toleranceAfter:kCMTimePositiveInfinity completionHandler:^(BOOL finished) {
        if(finished) {
           dispatch_async(dispatch_get_main_queue(), ^{
            [viewer show];
          });
        );
    }];

What happens in the code above is that in case we need to move the playhead to a specific time, we first determine this time then we ask the AVPlayer instance to seek to this time by trying to move the head in this position or after with some tolerance (kCMTimePositiveInfinity) but not before (kCMTimeZero in the toleranceBefore: parameter; we need this because the composition is made of all consecutive clips and then moving the playhead before the starting time of our clip could show a small portion of the previous clip). Note that this operation is not immediate and even if quite faster it could take about one second. What happens during this transition is that the player layer will show a still frame somewhere in the destination time region, than will start decoding the full clip and will resume playback starting from another frame, usually different than the still one. The final effect is not really good and after a few experimentation a decided to hide the player layer immediately before starting seeking and showing it again as soon the player class informs me (through the completionHandler callback block) that the movie is ready to be played again.

Conclusions and references

I hope this long post will push other developers to start working on interactive movie apps that will try to leverage the advanced video editing capabilities of iOS other than for video editing. The AVFoundation framework offers us very powerful tools and which are not difficult to use. In this post I didn’t explore some more advanced classes, such as AVVideoComposition and AVSynchronizedLayer. The former is used to create transitions, the latter is use to synchronize core animation effects with the internal media timing.

Great references on the subject can be found in the iOS Developer Library or WWDC videos and sample code:

  • For a general overview: AVFoundation Programming Guide in the iOS Developer Library
  • For the framework classes documentation: AVFoundation Framework Reference in the iOS Developer Library
  • Video: Session 405 - Discovering AV Foundation from WWDC 2010, available in iTunesU to registered developers
  • Video: Session 407 - Editing Media with AV Foundation from WWDC 2010, available in iTunesU to registered developers
  • Video: Session 405 - Exploring AV Foundation from WWDC 2010, available in iTunesU to registered developers
  • Video: Session 415 - Working with Media in AV Foundation from WWDC 2011, available in iTunesU to registered developers
  • Sample code: AVPlayDemo from WWDC 2010 sample code repository
  • Sample code: AVEditDemo from WWDC 2010 sample code repository

Comments