I Made a Trainer for IndexTTS v2 For New Languages - Work in Progress

By Jarods Journey

Summary

## Key takeaways - **First IndexTTS v2 Training Loop for New Languages**: The speaker claims to be the first to implement a training loop for IndexTTS v2, enabling fine-tuning for new languages like Japanese. This work is still in progress. [00:48] - **AI-Assisted Development with Codex CLI**: The speaker utilized OpenAI's Codex CLI to implement the training loop, collaborating with the AI agent by providing context and iterating through prompts to develop the necessary scripts for data processing and model training. [01:01], [06:21] - **Challenges in Data Pre-processing and Training**: The process involved significant challenges, including accidental data deletion, encountering segmentation faults when running scripts in WSL2 requiring workarounds in Windows, and lengthy data pre-processing that took a day for 450,000 samples. [00:30], [03:31] - **Three Key Stages of Training Implementation**: The implementation of the training loop is broken down into three main stages: building a new tokenizer, pre-processing the data, and establishing the training loop itself, with continuous iteration to fix any issues. [11:08] - **Focus on Text and Mel Loss for Improvement**: Key metrics for model improvement are identified as text loss, relating to text prediction accuracy, and mel loss, relating to sound prediction accuracy. Both are observed to be decreasing, indicating model progress. [11:31] - **Inspiration from Previous Repositories**: The development was inspired by existing repositories for IndexTTS v1.5, particularly for data pre-processing, adapting techniques for the newer v2 architecture which includes emotion and duration control. [12:38]

Topics Covered

Can TTS Models Replicate Emotion Across Languages?
Real-World Hurdles: Training TTS Models from Scratch.
AI Agents: Your Co-Pilot for Complex ML Training?
Mastering AI Agents: The Iterative Loop of ML Development.
New Languages: Why 'Fine-Tuning' is Really Training From Scratch.

Full Transcript

What's going on today, YouTube? Today

we're going to be talking about index

tts 2. So, I did this in my previous

video, but I showed you guys what index

2, how it could be used for dubbing, and

how you could get it installed. Um, and

so I took a really deep dive last week,

um, up until now to actually implement a

fine training um, a fine-tuning training

loop for new languages, and I think I've

got it. So, I've been hard at work at

this for a week. I had several

difficulties with this, so I'm going to

be talking about that in the video. And

um I ended up actually accidentally

deleting all my progress like yesterday

um uh because I unregistered WSL2 on

accident. Um and uh well I reimplemented

it and I have some results to talk

about. So very excited to show it off.

Um I don't think there's a training loop

out there yet that's been implemented.

So so far I think I'm the first one

that's done a V2 implementation. So um

cool. That's that's actually the f a

first for me. I've never implemented a

training loop. um uh on my own. And so

to say on my own would be kind of

misleading because I've used Codex CLI

to do this. But anyways, it's kind of

cool that we're able to use AI to

implement training loops for um these uh

open source text to speech models. So um

a little bit of blabbering, but um index

2, if you don't know, is a model

released by Billy Billy and um it's a

very fantastic model that's able to

replicate emotion um in its uh audio.

So, I can just give an example. Um, so

we have uh this uh sample. This is from

a video game um expedition 33. And the

the guy here is is fairly angry. So,

there's some swear words in here. So,

I'm going to go ahead and play it. But

uh yeah, this is it.

>> Oh, [ __ ] the mission.

[ __ ] the mission, Lun.

>> So, as you can tell, he's pretty mad. Um

and then this is the fine-tuned trained

version in Japanese here.

So there you go. Not bad in my opinion.

Um it it transferred over the emotion

which is what we've seen with index DTS2

and it's in Japanese which is uh not

supported by um the base indexts 2

model. So um the I mean the code and the

implementation for it is actually inside

of my GitHub repository here. uh

training v2. It is uh a work in progress

still. Um currently I am training uh

Japanese still. Um as you can see we've

got the training run going on here. Um

and the loss is still going down. So we

like to see curves uh where the the loss

is continually going down. Looks like

we're having a spike down there and the

text loss is down here and it's slowly

improving as we continue training. Um,

and so if you guys want to uh take a

look at what was implemented here, um, I

pushed it to GitHub and it's also to

save my um, progress because I don't

want to delete it on accidentally on

accident again. Um, and so we can train

a new tokenizer for language and

pre-process the data to train on and

then um, it is indeed possible to train

another language from scratch with

enough data. So I have been training

with the Amelia Yodas data set um which

is I believe,00 hours of Japanese and

I've actually for all of the training

done here I've only trained with half of

that because I literally just finished

pre-processing all of the data for some

for some reason that took about a day to

pre-process 450,000 samples. And uh so I

will be rerunning this training with uh

the full data set to see if that helps

improve um a little bit. So um yeah,

it's still it's still transferring files

uh from Windows to uh WSL because for

whatever reason this pre-processing uh

training script um in here fails inside

of WSL 2 with a segmentation fault. I

don't know what's causing it. So I have

to run it in uh Windows and then

transfer it over to WSL 2. Um, but um I

mean it's pretty much just going to be a

video of me blabbering about what I've

done. So um there's nothing too

fantastic uh that I could else show for

inference like in terms of audio um in

terms of different samples. I have

tested this in Japanese on other uh

voices like let's I can go and change

this to um we can do vivy um and

inference on this uh I do think kind of

the duration is a little broken so um

this vivy sample here is about 15

seconds and the sample that I'm trying

to output is um you know the target

would be around 4 to 5 seconds so we

might get a little bit of issues with

this inference here. So, the model isn't

yet perfect, but for the intents and

purposes of redubbing into um Japanese,

uh it seems like it's um it's coming

together to be able to um adapt the

emotion and then the same kind of

duration for it. Um so, here we go.

We've got it running here. Um and uh

this finished up, so we can take a

listen to it. It's 11 seconds. So, I

think there's going to be some

artifacting in here.

Yeah. So, this one has a bunch of

different artifacts. Um, it doesn't

actually say the sentence correctly, um,

which is here. It kind of said some of

it near the end, but I think we've kind

of got a little bit of a mismatch with

the duration from the input audio. Um,

and then what we want with the output

audio. Also, the input audio is

Japanese. I don't know if that affects

it too much here, but um, yeah, I I've

kind of got to, you know, play around

with it a little bit more to see uh,

what happened there. And, um, well, I

guess I'll just go into how I did this

implementation and how I got a training

loop up and running for this. So, I'm no

machine learning expert or researcher or

anything like that. Um, I just dabble a

little bit in training these models. Um,

and so I don't really have all the

technical skills to implement this from

scratch. So what I used was Codeex CLI,

which is OpenAI's um, latest model agent

in order to implement this. And I guess

I can just go over some of the prompts

and some of the uh, ways that I did this

with um, chatgpt uh,/codexcli.

So uh, this is kind of how I started off

with. So, I prefaced it with I with that

I've done this before, but I lost all of

the data in data deletion. Uh, which is

actually something that happened. And

then so I said the goal is to figure out

how to fine-tune index 2 for Japanese.

Um, here's what we had done and here's

the steps we followed. So, I outlined

kind of the steps that uh I want to go

through with the model to give it all of

the context that it needs to help me

build a fine-tuning training loop for

this. And so I give it the paper for the

uh index tts 2 versions one and two. I

give it this index tts Laura which is

some work from another guy. I'll show

you guys real quick. And then amphion

which is mask gctum which is used to uh

process some of the uh which I believe

is used to process the audio samples.

Um, and so yeah, I basically what my

idea is is to give the model um, and

collect everything I think the model

would need in order to implement the

training loop. Um, and then work through

with it step by step to uh, figure out

how to process the data. Uh, number one,

how to tokenize the data um, and then or

pre-process the data into, you know, uh,

numpy arrays or however it needs to be

pre-processed. um tokenize the text

tokens to make sure it can tokenize uh

the language and then see if we can

fine-tune the GPT the predicting model

to be able to um output the the the

correct uh tokens that we need for

Japanese and um you know that's kind of

the uh highle overview of how I think

about this and then uh oh how to extend

text embeddings of the original GPT

model so that it can take a different

tokenizer So um yeah, I I start off with

that. I collect all the resources that I

need to give codeci uh the context

around what I want to do and then we

basically just go back and forth uh

between answering questions um and then

me answering them. So what I like to do

with agents is to say something along

the lines of before we start doing any

coding, please start with clarifying

questions about the project as I want to

make sure we're on the same page. So,

um, I love to do this with any agents

that I'm using because, uh, I want to

make sure that what it's thinking I

want, uh, is in align with what I want.

And, uh, to do that, you know, I I make

sure that we're on the same page. And

then, so, yeah, I answer all the

questions here. Um, and then I basically

do it again. Answer all the questions

that asked here for clarifications. Um

and then the first step that we end up

getting to is um

or the next one would be

okay. So here what I wanted to do was to

test the tokenizer. So I've had issues

with training text models in the past

where the tokenizer doesn't actually

work. Um the tokenizer is what turns the

uh the the text the words into uh

numbers that the uh model can

understand. Um and the numbers that the

uh prediction weights are are stored in.

Uh so we do that to um make sure that

there are no unknown characters. So

that's what these tokens are. But we

should not be getting ank tokens because

when we're using the tokenizer, we don't

want any unknown characters in Japanese

uh that we're trying to inference on. So

I want to test the tokenizer first. Got

that finished up. So um these next few

messages are um verifying that the

tokenizer works. So here I've got like

run verification please again. And then

the next stage uh which would be to

pre-process the data and um I worked to

you know prompt it to build the

pre-processing loop. So, here we go. We

have this um

and then yeah, just some back and forth

between uh myself and the agent here.

And then, you know, I was running into

some crashes. And so, I ended up moving

it to Windows because um of some stuff

that I found online. And then, yeah, so

that's the pre-processing. And then it

also built the um you know, the training

loop. And so here I'm reiterating to,

you know, change how it saves and

everything like that. And then yeah,

it's broken down for the most part into

three stages. Um, building a new

tokenizer, pre-processing the data, and

then the training loop. And so if the

training loop had any issues with it,

then I would reiterate with the agent to

try to fix it to see if we could get the

um model training for what we needed to.

So there were two things that I

identified that we needed um or would be

good to focus on for loss which was text

loss and mail loss. Um so text loss I

believe is um how accurately it's

predicting the text and then the male

loss is how accurately it's predicting

the sound. Um but I would have to double

check and verify. But um it seems as

both of these are going down the model

is getting better and better at

predicting Japanese. So it's still in

progress. Um I still have to completely

uh train the model. But that is kind of

the gist of how I worked with the CLI

agent in order to uh implement this fine

training uh fine-tuning loop. Um it

might not even be called fine-tuning.

This is probably just uh complete

training from scratch because what we

did was we in uh reinstantiated all of

the tokens inside of the new tokenizer.

Um, so all of the old tokens we're

actually going to map to these new

tokens slash new characters that are in

Japanese um, and completely train those

um, completely train the weights on a

entirely new um, tokenizer corpus. So I

ended up also uh, basing some of the um,

uh, prompts to look at this repository

here. So definitely helps out that there

was someone who had implemented um

training inside of oh here it is um

training for indexts um 1.5. So this

repository works for 1.5 indexts um and

so the pre-processing for the um data is

based off of this repository here. So I

I believe, you know, it it seems that

it's working for the version two of the

model. And so I'm assuming the

pre-processing didn't really change uh

too much between version one, version

two. Um the only thing that really

changed was the architecture and the add

addition of I believe the emotion um and

the duration control. So this repository

is not um up to date with version two.

And then there was another repository

here um which was for version one um for

the repo. I didn't actually use this uh

one at all for reference uh to codeex

CLI on how to implement a training loop.

But yeah, these were the two um

inspirations for uh um the

pre-processing of the data and then even

trying to train a little bit. So

actually some history um before the full

fine-tuning run I had actually tried to

implement an aurora uh fine-tune um

parameter efficient fine-tuning inside

of here um the index ttsv2 trainer but

it wasn't working as I as I needed um

because uh I I think we need to train

more of the um uh attention heads inside

of the model. So um I think we Yeah. So,

we even have we removed like the Laura

stuff in here. So, I had tried training

with Laura and that didn't work. The

Laura training was actually from the

previously deleted repository that I

unfortunately lost. So, I think it

helped that I did in fact start from

scratch here to um get out um or to um

get rid of any old remnants of Laura

training because I think that might have

thrown off codeex in this

implementation. But yeah, um pretty much

the summary is that the trainer here is

able to train um it seems like for a new

language uh fairly uh successfully or

and it is successful in um you know

creating outputs for the language. Uh I

I think I have like another one here.

Um, well, I've got Lun that I could try

to try to inference on, but um,

yeah, I guess I I'll stop here and stop

stop rambling. Uh, I just wanted to kind

of create a video just talking about

this because I've spent a lot of time

about a week on this. Um, and uh, yeah,

it's it's kind of cool that uh, we're

able to implement a full training loop

with uh, Codeex CLI. Um, and uh, yeah,

like I said, I I don't have, you know,

previous experience with implementing

this from scratch uh, manually. So,

yeah, we'll see how this ends up going.

I'll probably update a little bit later.

Um, maybe release the weights for this

uh, to see um, uh, you know, if people

find the the weights useful. But other

than that, um, that's going to be pretty

much it for today's video. And, uh, once

again, like to thank all the members of

the channel for supporting me. Very much

appreciate it. And I will see you guys

Loading...

Loading video analysis...