I Made a Trainer for IndexTTS v2 For New Languages - Work in Progress
By Jarods Journey
Summary
## Key takeaways - **First IndexTTS v2 Training Loop for New Languages**: The speaker claims to be the first to implement a training loop for IndexTTS v2, enabling fine-tuning for new languages like Japanese. This work is still in progress. [00:48] - **AI-Assisted Development with Codex CLI**: The speaker utilized OpenAI's Codex CLI to implement the training loop, collaborating with the AI agent by providing context and iterating through prompts to develop the necessary scripts for data processing and model training. [01:01], [06:21] - **Challenges in Data Pre-processing and Training**: The process involved significant challenges, including accidental data deletion, encountering segmentation faults when running scripts in WSL2 requiring workarounds in Windows, and lengthy data pre-processing that took a day for 450,000 samples. [00:30], [03:31] - **Three Key Stages of Training Implementation**: The implementation of the training loop is broken down into three main stages: building a new tokenizer, pre-processing the data, and establishing the training loop itself, with continuous iteration to fix any issues. [11:08] - **Focus on Text and Mel Loss for Improvement**: Key metrics for model improvement are identified as text loss, relating to text prediction accuracy, and mel loss, relating to sound prediction accuracy. Both are observed to be decreasing, indicating model progress. [11:31] - **Inspiration from Previous Repositories**: The development was inspired by existing repositories for IndexTTS v1.5, particularly for data pre-processing, adapting techniques for the newer v2 architecture which includes emotion and duration control. [12:38]
Topics Covered
- Can TTS Models Replicate Emotion Across Languages?
- Real-World Hurdles: Training TTS Models from Scratch.
- AI Agents: Your Co-Pilot for Complex ML Training?
- Mastering AI Agents: The Iterative Loop of ML Development.
- New Languages: Why 'Fine-Tuning' is Really Training From Scratch.
Full Transcript
What's going on today, YouTube? Today
we're going to be talking about index
tts 2. So, I did this in my previous
video, but I showed you guys what index
2, how it could be used for dubbing, and
how you could get it installed. Um, and
so I took a really deep dive last week,
um, up until now to actually implement a
fine training um, a fine-tuning training
loop for new languages, and I think I've
got it. So, I've been hard at work at
this for a week. I had several
difficulties with this, so I'm going to
be talking about that in the video. And
um I ended up actually accidentally
deleting all my progress like yesterday
um uh because I unregistered WSL2 on
accident. Um and uh well I reimplemented
it and I have some results to talk
about. So very excited to show it off.
Um I don't think there's a training loop
out there yet that's been implemented.
So so far I think I'm the first one
that's done a V2 implementation. So um
cool. That's that's actually the f a
first for me. I've never implemented a
training loop. um uh on my own. And so
to say on my own would be kind of
misleading because I've used Codex CLI
to do this. But anyways, it's kind of
cool that we're able to use AI to
implement training loops for um these uh
open source text to speech models. So um
a little bit of blabbering, but um index
2, if you don't know, is a model
released by Billy Billy and um it's a
very fantastic model that's able to
replicate emotion um in its uh audio.
So, I can just give an example. Um, so
we have uh this uh sample. This is from
a video game um expedition 33. And the
the guy here is is fairly angry. So,
there's some swear words in here. So,
I'm going to go ahead and play it. But
uh yeah, this is it.
>> Oh, [ __ ] the mission.
[ __ ] the mission, Lun.
>> So, as you can tell, he's pretty mad. Um
and then this is the fine-tuned trained
version in Japanese here.
So there you go. Not bad in my opinion.
Um it it transferred over the emotion
which is what we've seen with index DTS2
and it's in Japanese which is uh not
supported by um the base indexts 2
model. So um the I mean the code and the
implementation for it is actually inside
of my GitHub repository here. uh
training v2. It is uh a work in progress
still. Um currently I am training uh
Japanese still. Um as you can see we've
got the training run going on here. Um
and the loss is still going down. So we
like to see curves uh where the the loss
is continually going down. Looks like
we're having a spike down there and the
text loss is down here and it's slowly
improving as we continue training. Um,
and so if you guys want to uh take a
look at what was implemented here, um, I
pushed it to GitHub and it's also to
save my um, progress because I don't
want to delete it on accidentally on
accident again. Um, and so we can train
a new tokenizer for language and
pre-process the data to train on and
then um, it is indeed possible to train
another language from scratch with
enough data. So I have been training
with the Amelia Yodas data set um which
is I believe,00 hours of Japanese and
I've actually for all of the training
done here I've only trained with half of
that because I literally just finished
pre-processing all of the data for some
for some reason that took about a day to
pre-process 450,000 samples. And uh so I
will be rerunning this training with uh
the full data set to see if that helps
improve um a little bit. So um yeah,
it's still it's still transferring files
uh from Windows to uh WSL because for
whatever reason this pre-processing uh
training script um in here fails inside
of WSL 2 with a segmentation fault. I
don't know what's causing it. So I have
to run it in uh Windows and then
transfer it over to WSL 2. Um, but um I
mean it's pretty much just going to be a
video of me blabbering about what I've
done. So um there's nothing too
fantastic uh that I could else show for
inference like in terms of audio um in
terms of different samples. I have
tested this in Japanese on other uh
voices like let's I can go and change
this to um we can do vivy um and
inference on this uh I do think kind of
the duration is a little broken so um
this vivy sample here is about 15
seconds and the sample that I'm trying
to output is um you know the target
would be around 4 to 5 seconds so we
might get a little bit of issues with
this inference here. So, the model isn't
yet perfect, but for the intents and
purposes of redubbing into um Japanese,
uh it seems like it's um it's coming
together to be able to um adapt the
emotion and then the same kind of
duration for it. Um so, here we go.
We've got it running here. Um and uh
this finished up, so we can take a
listen to it. It's 11 seconds. So, I
think there's going to be some
artifacting in here.
Yeah. So, this one has a bunch of
different artifacts. Um, it doesn't
actually say the sentence correctly, um,
which is here. It kind of said some of
it near the end, but I think we've kind
of got a little bit of a mismatch with
the duration from the input audio. Um,
and then what we want with the output
audio. Also, the input audio is
Japanese. I don't know if that affects
it too much here, but um, yeah, I I've
kind of got to, you know, play around
with it a little bit more to see uh,
what happened there. And, um, well, I
guess I'll just go into how I did this
implementation and how I got a training
loop up and running for this. So, I'm no
machine learning expert or researcher or
anything like that. Um, I just dabble a
little bit in training these models. Um,
and so I don't really have all the
technical skills to implement this from
scratch. So what I used was Codeex CLI,
which is OpenAI's um, latest model agent
in order to implement this. And I guess
I can just go over some of the prompts
and some of the uh, ways that I did this
with um, chatgpt uh,/codexcli.
So uh, this is kind of how I started off
with. So, I prefaced it with I with that
I've done this before, but I lost all of
the data in data deletion. Uh, which is
actually something that happened. And
then so I said the goal is to figure out
how to fine-tune index 2 for Japanese.
Um, here's what we had done and here's
the steps we followed. So, I outlined
kind of the steps that uh I want to go
through with the model to give it all of
the context that it needs to help me
build a fine-tuning training loop for
this. And so I give it the paper for the
uh index tts 2 versions one and two. I
give it this index tts Laura which is
some work from another guy. I'll show
you guys real quick. And then amphion
which is mask gctum which is used to uh
process some of the uh which I believe
is used to process the audio samples.
Um, and so yeah, I basically what my
idea is is to give the model um, and
collect everything I think the model
would need in order to implement the
training loop. Um, and then work through
with it step by step to uh, figure out
how to process the data. Uh, number one,
how to tokenize the data um, and then or
pre-process the data into, you know, uh,
numpy arrays or however it needs to be
pre-processed. um tokenize the text
tokens to make sure it can tokenize uh
the language and then see if we can
fine-tune the GPT the predicting model
to be able to um output the the the
correct uh tokens that we need for
Japanese and um you know that's kind of
the uh highle overview of how I think
about this and then uh oh how to extend
text embeddings of the original GPT
model so that it can take a different
tokenizer So um yeah, I I start off with
that. I collect all the resources that I
need to give codeci uh the context
around what I want to do and then we
basically just go back and forth uh
between answering questions um and then
me answering them. So what I like to do
with agents is to say something along
the lines of before we start doing any
coding, please start with clarifying
questions about the project as I want to
make sure we're on the same page. So,
um, I love to do this with any agents
that I'm using because, uh, I want to
make sure that what it's thinking I
want, uh, is in align with what I want.
And, uh, to do that, you know, I I make
sure that we're on the same page. And
then, so, yeah, I answer all the
questions here. Um, and then I basically
do it again. Answer all the questions
that asked here for clarifications. Um
and then the first step that we end up
getting to is um
or the next one would be
okay. So here what I wanted to do was to
test the tokenizer. So I've had issues
with training text models in the past
where the tokenizer doesn't actually
work. Um the tokenizer is what turns the
uh the the text the words into uh
numbers that the uh model can
understand. Um and the numbers that the
uh prediction weights are are stored in.
Uh so we do that to um make sure that
there are no unknown characters. So
that's what these tokens are. But we
should not be getting ank tokens because
when we're using the tokenizer, we don't
want any unknown characters in Japanese
uh that we're trying to inference on. So
I want to test the tokenizer first. Got
that finished up. So um these next few
messages are um verifying that the
tokenizer works. So here I've got like
run verification please again. And then
the next stage uh which would be to
pre-process the data and um I worked to
you know prompt it to build the
pre-processing loop. So, here we go. We
have this um
and then yeah, just some back and forth
between uh myself and the agent here.
And then, you know, I was running into
some crashes. And so, I ended up moving
it to Windows because um of some stuff
that I found online. And then, yeah, so
that's the pre-processing. And then it
also built the um you know, the training
loop. And so here I'm reiterating to,
you know, change how it saves and
everything like that. And then yeah,
it's broken down for the most part into
three stages. Um, building a new
tokenizer, pre-processing the data, and
then the training loop. And so if the
training loop had any issues with it,
then I would reiterate with the agent to
try to fix it to see if we could get the
um model training for what we needed to.
So there were two things that I
identified that we needed um or would be
good to focus on for loss which was text
loss and mail loss. Um so text loss I
believe is um how accurately it's
predicting the text and then the male
loss is how accurately it's predicting
the sound. Um but I would have to double
check and verify. But um it seems as
both of these are going down the model
is getting better and better at
predicting Japanese. So it's still in
progress. Um I still have to completely
uh train the model. But that is kind of
the gist of how I worked with the CLI
agent in order to uh implement this fine
training uh fine-tuning loop. Um it
might not even be called fine-tuning.
This is probably just uh complete
training from scratch because what we
did was we in uh reinstantiated all of
the tokens inside of the new tokenizer.
Um, so all of the old tokens we're
actually going to map to these new
tokens slash new characters that are in
Japanese um, and completely train those
um, completely train the weights on a
entirely new um, tokenizer corpus. So I
ended up also uh, basing some of the um,
uh, prompts to look at this repository
here. So definitely helps out that there
was someone who had implemented um
training inside of oh here it is um
training for indexts um 1.5. So this
repository works for 1.5 indexts um and
so the pre-processing for the um data is
based off of this repository here. So I
I believe, you know, it it seems that
it's working for the version two of the
model. And so I'm assuming the
pre-processing didn't really change uh
too much between version one, version
two. Um the only thing that really
changed was the architecture and the add
addition of I believe the emotion um and
the duration control. So this repository
is not um up to date with version two.
And then there was another repository
here um which was for version one um for
the repo. I didn't actually use this uh
one at all for reference uh to codeex
CLI on how to implement a training loop.
But yeah, these were the two um
inspirations for uh um the
pre-processing of the data and then even
trying to train a little bit. So
actually some history um before the full
fine-tuning run I had actually tried to
implement an aurora uh fine-tune um
parameter efficient fine-tuning inside
of here um the index ttsv2 trainer but
it wasn't working as I as I needed um
because uh I I think we need to train
more of the um uh attention heads inside
of the model. So um I think we Yeah. So,
we even have we removed like the Laura
stuff in here. So, I had tried training
with Laura and that didn't work. The
Laura training was actually from the
previously deleted repository that I
unfortunately lost. So, I think it
helped that I did in fact start from
scratch here to um get out um or to um
get rid of any old remnants of Laura
training because I think that might have
thrown off codeex in this
implementation. But yeah, um pretty much
the summary is that the trainer here is
able to train um it seems like for a new
language uh fairly uh successfully or
and it is successful in um you know
creating outputs for the language. Uh I
I think I have like another one here.
Um, well, I've got Lun that I could try
to try to inference on, but um,
yeah, I guess I I'll stop here and stop
stop rambling. Uh, I just wanted to kind
of create a video just talking about
this because I've spent a lot of time
about a week on this. Um, and uh, yeah,
it's it's kind of cool that uh, we're
able to implement a full training loop
with uh, Codeex CLI. Um, and uh, yeah,
like I said, I I don't have, you know,
previous experience with implementing
this from scratch uh, manually. So,
yeah, we'll see how this ends up going.
I'll probably update a little bit later.
Um, maybe release the weights for this
uh, to see um, uh, you know, if people
find the the weights useful. But other
than that, um, that's going to be pretty
much it for today's video. And, uh, once
again, like to thank all the members of
the channel for supporting me. Very much
appreciate it. And I will see you guys
Loading video analysis...