Pcedev blog - Hands on Sinsy, a free software solution for song vocal synthesis

View this post in french / Voir ce billet en français

You may have already used or considered song vocal synthesis for your musical composition, with software like Vocaloid, CeVIO, UTAU, Alter/Ego.

However when you care about open source software, free (as in free speech and free beer) software or native support in libre OSes, you don't have a lot of choices anymore. You still have e-cantorix, but it's still lacking in aesthetic qualities as of today.

Being myself a linux user wanting to have some placeholder vocals on my compositions, I was a bit annoyed by the lack of choices to synthetize song vocals.

And I stumbled on Sinsy, an (unfairly) little known free (as in free beer and free speech) open source vocals synthetizer which gives quite good results. To allow you to give a try at Sinsy, I'll guide you quickly through the steps of creating synthetic vocals with it.

A quick teaser

I'm writing and writing and you still haven't heard any sample, have you ?

So, here's a quick sample of what Sinsy can generate.

score extract fed to sinsy — Score extract fed to **Sinsy**.
Lyrics by Leyra Kagiyama, melody by Kirin&ZUN

Quite nice, isn't it ?

This sample was part of a longer remix I did:

Now, let's dig into what you have to do to achieve similar results

Sinsy in a nutshell

Sinsy is an open source vocal synthetizer licensed under modified BSD license which has a page on sourceforge : http://sinsy.sourceforge.net and also comes with a web service demo page : http://sinsy.jp

The web service demo page is served by a much newer version than the version on sourceforge. Sinsy developers told us that the public version will be gradually updated.

Technologically speaking, Sinsy uses HMM (Hidden Markov Models) to learn from actual human vocals and generate synthetics vocals fittings any given score. Unlike UTAU for instance, the learning phase is not really accessible to feeble humans and best left to the research team for now.

Lyrics

First, you'll need lyrics. In my case, I was interested in japanese lyrics, as you have noticed, but fear not, Sinsy web service can also handle english and mandarin

Prepare your lyrics, in any text editor, with syllabs separated by space (more on this later) and keep those for later

If you want japanese lyrics, you can either copy/paste them if they exist already, write them if your system is configured properly, or you can rely on an external service like this online japanese keyboard (among others).

One important step when preparing lyrics is that they need to be "as heard". For instance, は in こんにちは is pronounced wa despite being written as ha, so you'll write your lyrics こんにちわ.

Melody

Sinsy requires as input MusicXML lyrics annotated score. We can theorically use any, but I'll show the step with a libre editor.

Let's install MuseScore, it can be packaged for your linux distribution or downloaded from its homepage. If you don't know it already, there is a Quick starter section in its documentation. Basically, you need to know that you add note with n then c, d, e, f, g, a, b, quit adding note with ESC, create slurs with s, create tie with + and enter lyrics with Ctrl + l.

At this point, you can either enter notes manually (via keyboard shortcut or mouse) or import a midi file (from your DAW for instance).

You want to make sure that you only use voice #1 as it will be the only one considered by Sinsy.

Furthermore, you need to let silence before and after your melody (one bar as prefix and one bar as suffix typically). This is due to the fact that the voice synthesis may choose to start the audio a bit earlier than your actual note to create a natural sound (same story for the end note).

Mashing lyrics with melody

Now we can add lyrics to your melody. Copy your space separated lyrics in your clipboard.

Then select the first note to which you want to add lyrics and with Ctrl + l, you enter lyrics mode.

You can now paste your lyrics. MuseScore will consume only your clipboard up to the first space. Which means that you can hammer Ctrl + V to fill all subsequent notes with your clipboard content.

Please note that when you have a tie (tying 2 notes of the same height, to sum their length), you should only have lyrics on the very first one (Sinsy would repeat the consonant otherwise).

At this point, your score should look like the example above :

You can now export your score in MusicXML format via the File > Export menu entry.

Fed your score to the beast

Your MusicXML can be send to the webservice by going manually on http://sinsy.jp, uploading your MusicXML score and selecting which language, voice, gender parameter, vibrato power and pitch you want.
Sinsy will compute your .wav and display it on page refresh.

You can also use a small helper script I've coded for this very purpose : sinsy-cli. It's a python script which accepts one or more MusicXML scores, uploads them while trying not to overload their server and downloads back the resulting .wav

You can install sinsy-cli by typing:

pip install sinsy-cli

Its homepage is https://gitlab.com/zeograd/sinsy-cli, but it's a very simple CLI with sensible defaults, you're clever enough to figure out how to use it properly.

Final notes

Now that you have your wav files out of your lyrics and melody, you "only" have to feed them to your favorite DAW, postprocess them a bit and fame is yours :)

Ah, and don't forget to credit Sinsy if you publish your work, it's part of the web service term of use.

You may notice that the initial silence on the resulting wav is not always consistent. It will very often fit the silence bar you included in your score, but sometimes it can be off by a couple of beats in either direction. YMMV, so don't be surprised if you have to resync manually some sentences.

I haven't discussed more advanced use of Sinsy (slurs, breathes, ...) but this introduction post is already long enough. Please comment if you want a follow up.

Until then, go make some (lovely) noises :)