Instrumental Anime Project

downwithpants · Post by **downwithpants** » Thu Nov 25, 2004 4:59 pm

that's clever for an automated lip synch program. although using levels of intensity wouldn't catch all consonants, it'd probably work well for voiced and unvoiced stops (b, g, d, p, k, t), and unvoiced fricatives and affricates (s, f, ch).

while intensity is a rough cue for discriminating between vowels and consonants, unfortunately we aren't at the stage of being able to discriminate between individual consonants by their spectra (which would help with more intricate lip synching, like synching tongue, teeth, and lip movements), as the spectrum of any phoneme in spoken context is dependent on the preceding and following phonemes.

rose4emily wrote:Finally, the phrase "the end is near" is more probable with relation to this project than it is with respect to the existance of humanity.

woo hoo!

anyways to everyone celebrating thanksgiving, enjoy your turkey or vegetarian alternative.

Org Profile · Post by **Otohiko** » Thu Nov 25, 2004 6:00 pm

Janzki wrote:
Otohiko wrote:Whoa. You made an auto-lip-sync program?

The possibilities!

Heh, well, I can see a lot of limits to it, but on the other hand...

Imagine if someone made a program that used voice-recognition software would match up mouth position to proper (or closest-matching) mouth positions for an anime character? I mean, eh...

The best I've done so far is pay attention to what I learned in Phonetics class when lip-synching - not that mine ended up looking too bloody great anyway

***

But yes, the nearer to this End, the better. Happy thanksgiving Americans (and anyone else who may be celebrating. Heh, we Canadians already had ours)

Bakadeshi [AuN Studios] · Post by **Bakadeshi [AuN Studios]** » Fri Nov 26, 2004 11:07 pm

Heh, I've held on releasing my segment to this for the project to release first, and have since almost forgot I even had the dang thing lol. For real its about time this project is nearing completion.

Strange it stopped sending me email notifications of updates to this thread, I didn;t even know there was activity till I noticed it in the multieditor project forums.

On a side note, I notices Jasper-isis gave up on predicting when this project would release....

rose4emily · Post by **rose4emily** » Sun Nov 28, 2004 12:29 am

I got stuck on a train car with no electrical sockets on the way back, so I'm a bit behind schedule. Still, I have narrative video for four of the six fullscreen sections, and can have the other two done within a half hour of receiving the remaining two narratives from Song.

I've also just about finished the intro. Needs a little more work, but the bones are all there. I did decide to go with the Yo-Yo Ma track from his "Obrigado Brazil" album. I'll have to dig out the liner notes to get the name of the song for the credits, but I can say for the moment that it's brisk, high-spirited track with just a little international flavor (though I think this particular one seems as much French as it does Brazillian).

I've recorded new audio for my narrative sections - it's much clearer than the old audio, so that's a good thing. I think I'll have to adjust the threshold levels in my lip-synching program to get them to look right when applied to my voice (the program doesn't correct any of the levels for the real or perceptual relationships between amplitude, energy, and frequency spectrum). When I do, however, the lip-synch should probably come out a little cleaner than the fulscreen set did, as my speech is a fair bit slower than Songs (the program does have noticable issues with rapidly repeated consonents, making the mouth seem to either waver at some unnatural speed or else not move at all if the consonents are all either voiced or unvoiced and read off very quickly).

I've been correcting for this deficiency regarding rapidly spoken sets of consonents in the widescreen set by editing the resulting shell script by hand in places where I think something is noticably "off". This is also how I add the blinking, which is partially random, and partially timed where I would likely blink giving the same speech. Having a little eye movement (actually my father's idea) does go a long way toward making the scene seem more natural, and dividing the viewer's attention three ways between eyes, mouth, and slides gives them plenty to look at and should keep them from overanalyzing any single one of the three.

Oddly enough, neither the mouth nor the blinking eyes are from the same video clip as Miazawa's body and head. She was taken from a more or less still image that occupied a few frames in the intro, the mouth came off of her mother at some scene where I noticed a similarly posed head (Miazawa seems to usually have her head at an angle when she speaks, or else be in some SD form). The blinking eyes were from Miazawa, in a scene later in the series where she did have her head at an angle (amazing how much the eyes are ignored in any dialogue that isn't super close-up), but I actually used one and flipped it for the other eye to maintain symmetry for the forward-looking Miazawa doing the narrations. A little rescaling of each piece, some repainting to match skin tones, and some creative cut-and-paste work, and our Frankenstein-creation narrator was transformed from a static head to a 2D puppet of sorts that could be animated through the choice of which out of six cels (three with eyes open, three with eyes shut) to use for any given frame.

---

As to making a "better" lip-synching program, I would optimally support a greater number of volume threshold levels, a more sophisticated measurement of volume than averaging absolute values of all the samples with a given frame, spectral analysis to at least correlate vowel sounds to the shape of the mouth, and transient detection which would, at the very least, pick up on the presence of all those relatively lound but ephemeral unvoiced "t" and "p" sounds.

This would, of course, require a lot of different mouth images to be prepared ahead of time.

Then there are inter-consonent relationships and compound vowel sounds and so forth - but most of the more complex vocal nuances would require something much less abstract than an anime character's mouth for expression anyhow.

I hear that South Park uses some sort of automated lip-synching for their characters - though I believe their program just reads letters off of the script and picks a corresponding mouth for each one, thus creating their jittery signiature look while maintaining at least some visual correlation between the sound of a word and what it looks like as it is being said.

There was also that robot thing from the early 90's that the AI community tried to present as a model for simulated human expression. Reminded me more of the lips from the opening of the Rocky Horror Picture Show, but it could at least "speak" and "smile" in a recognizable manner. Maybe the technology's matured a bit since then, though.

Songbird21 · Post by **Songbird21** » Sun Nov 28, 2004 2:18 am

I'll get my last 2 narratives in as soon as I can. I have a nasty little cold at the moment so my voice sounds like crap. gomen nasai.

rose4emily · Post by **rose4emily** » Sun Nov 28, 2004 8:09 pm

Thanks, Song. No need to apologize - it's my fault you have to re-record your half of the narratives in the first place.

I am now putting together the video for the widescreen narratives. It's taking me a little longer than I thought, mainly because I forgot that I hadn't yet composited all of the individual slides yet.

See, each narrative requires one image for every possible combination of eye position, mouth position, and on-screen slide. For a narrative with six slides (most are somewhere around this number), I need 2x3x6 or 36 images. For a narrative with eight slides, that's 48 images. The good news is that I've been doing the compositing from the command line, rather than in some GUI image editing application - so each one doesn't take very long. The bad news is that I still have to create about 200 slides before I can finish the widescreen narratives. Currently, I have about a third of them out of the way, and the rest are coming together fast. This just put a little unexpected bottleneck in my working process.

I have put together the encoded fade in and fade out for the widescreen narratives, and encoded versions of all of the widescreen titles fading in and out over the course of about 5 seconds. All of this is also done for the fullscreen section, I've just had to repeat the process with different source images for the widescreen set.

I've also managed to create an eyes-shut version of my widescreen narrator (I switched out Kawashima for Tsubasa's father [still from Kare Kano] after having some trouble finding a shot of Kawashima that would allow me to present him in about the same proportions and pose as Yukino, that didn't have strange discoloration at the top of his head [much of Kare Kano seems to have a darkened region near the top of the screen that looks somewhat unnatural in other settings]). I actually had to paint the shut eyes, and they lack the detail of the ones I found for Yukino. I'm not too worried about this, however, as Yukino had much larger eyes in relation to her head, and each blink takes place over the course of only two frames.

---

Update on my plans for the end credits:

I've found that just "sliding" the image tiles I've collected across the screen to hide and reveal text looks pretty ugly, and poses something of a conundrum as to what I should do when the tiles intersect in the middle of the screen (I was bringing one in from each side).

So I've sketched out a new adaptation of this idea in which the tiles are rotating planes with an axis on the vertical perpendicular to the viewer's line of sight (so the planes rotate along a horizontal plane as seen by the viewer). I'm setting the spin to such a rate and phase that each tile is revealed once at full width halfway to the center of the screen, they meet when they are both edgewise to the viewer, and each can be seen once again at full width halfway toward the edge of the screen. This resolves the overlap issue and should be much more interesting to look at. It is also simple enough that it can be done entirely in the 2D domain through foreshortening tricks.

That's the part I know I'm doing. Here's the part that I'd like to do, but don't yet know how well it will work:

As each tile spins past (this is a fairly slow spin, think of somehting adrift in space) the current text, it will dissolve and disperse like dust. I know how to do this, but I'm a little concerned about the complexity of the required code getting in the way of my actually writing it by the time I get the remaining two narratives from Song.

As the tiles meet, they will "tear open" a vertical-ish particle column, from which the new text will coalesce. Once again, I know how to do this, but don't know exactly how long it would take.

I'll write up the "particle free" version first, as that one won't take long at all, so I'll have something done even if the "really cool" part fails to come together.

---

The intro text is just a list of editors, presented in alphabetical order (sorted "Last First", shown "First Last"), with the word "Instrumentality" in somewhat larger letters at the end. "Instrumentality" then fades into "Animasia" - but I'm going to try that "particle-dissolve" idea on that, too, since I think it would look much better. Your names are revealed and collapsed by a pair of glowing blue lines, and the only two image tiles used are one of each narrator (the animated versions, not our real pictures), just before the presenation of the word "Instrumentality".

---

The reason I'm writing all of this out in text, instead of just pointing you to something to download, is that my home connection is currently uploading at about dial-up speed on both the standard FTP and HTTP ports, and I'm afraid my account on Grace (RIT's server) would get revoked if it became the cause of some huge sudden bandwidth spike thanks to using it to share video footage with more people than probably visit the typical student page over the course of a month (I am just talking about the nine of you, not the audience I hope will see this after it's done).

I've a half-finished network file transfer app I started last year, but grew bored with. I might be reviving that this weekend if my internet connection doesn't open up a little - as it uses non-standard ports to transfer single files to other people using the same app (think of it as a really limited cross of a FTP client and a P2P file-sharing application, which has no search capability whatsoever and supports only a single directory of files). That app, as generally useless as it is, might be what I need to get this file out to you before uploading it for Local download.

I might also end up having one or two of you download a copy from me, and then share it with the others. I'd like to give you a chance to see this before it's fully locked as being in its final form, but you might have to jump through a couple of hoops to get it.

---

One last time (I hope): If there's anything you want to see in this project that you haven't yet given me, there isn't much time left to submit it. The only two files I'm going to wait for, at this point, are Song's remaining narratives. Everything else gets in if I get it on time.

Org Profile · Post by **Otohiko** » Sun Nov 28, 2004 8:38 pm

Well, it certainly sounds like you're on track...

As far as the spinning tiles and such, I'm certainly no expert on those, especially in the way that you seem to be making them. In any case, I can see the in-between work (that is, what's between the vids) in itself is gonna end up as something very interesting in itself.

All the more reason to wait for the project....

***

Meanwhile, about things to pass on to you to change/add - are you past the stage of implementing the pictures (I think they're supposed to be in the ending)?

If not, the only thing I might want to send is a slightly updated picture of me, since the one you're currently using isn't quite up to date.

Given how unimportant this is to the video, I guess it's not too much of a concern either way...

rose4emily · Post by **rose4emily** » Tue Nov 30, 2004 1:42 pm

The way I'm doing the end credits, I can switch out images right up to the point where I encode it for the compilation. So, yes, new images can still be submitted for the end credits.

The images that are locked in at this point are the ones for the narratives, as I've already done nearly all of the work for compositing, animating, and encoding those. I still have a couple to go for the animating and encoding portions of the process, but all of the "slides" have already been composited into the set.

I've made slides from screenshots for the couple of videos that I hadn't recieved any slides for. For these, I've done my best to pick ones that looked good and captured the spirit of the video/narrative.

---

The "text dissolve" effect for the end credits is solidifying in my mind. I think I can get a pretty convincing effect out of applying successive randomizations to the location of values in each vertical column of the image as the tiles pass them. It's not the most computationally efficient method possible, I imagine, but it should look convincing enough once rendered (and who in their right mind would worry about optimization for a "use-once" program, anyhow). I'm actually thinking to randomize the placement of the pixel values in two passes, and average them for each row for each frame. This will not only break the text into particles, but also cause a diffusion of light toward a cloud-like point of entrophy. I also think the randomization function should be weighted, to model a slow dispersion rather than an instantanious one.

I could try the dispersion in two dimensions, but I think that would be much more complex - and I don't imagine it would really look all that much better than the version I'm working on, which performs all of its operations on vertical stripes of pixels.

The resulting program will be very simple, but have some unusual (at least for me) algorithms in it (and have returned to my tight academic schedule), so I expect it won't be done until the end of the upcoming weekend. In a way, this seems like an awfully long delay for a single text effect - but I think it'll be worth it, when you consider the unique touch it will add to the look of the intro and credits.

---

Yesterday, I made what is probably the quickest (and, ironically, most watchable) "AMV" I have ever produced. A synchronized fight scene from ep. 43 of Card Captor Sakura and the audio from a similar scene in "Dance Like You Want To Win", that happen to fit so well that it looks like that scene was designed to fit that song. No real effort on my part, I just muxed the two together, but I thought it looked cool enough to put up for Local download.

Org Profile · Post by **Otohiko** » Tue Nov 30, 2004 3:05 pm

Perhaps you should post about that in the Unintentional Sync thread going on there now?

kingmob_867 · Post by **kingmob_867** » Tue Nov 30, 2004 5:30 pm

so.... is it done yet?