ElevenLabs is one of the best TTS (Text to Speech) softwares on the market. Their AI voices and voice cloning are incredibly impressive and almost indistinguishable from human voices.
However, while their quality is certainly a step above most of their competitors, you might be wondering if there are any alternatives out there that can rival them in price and quality.
I’ve done extensive testing of a variety of Text to Speech services and Python packages for my own personal use and in this blog post I’ll share my findings with you to give you an idea of which service is best for you.
There are some free tools as well as some paid tools in this list, but the free tools are often more difficult to set up than the paid tools.
To make the comparisons more valid and to make it easier for you to see which one you like most, I have included samples from every service into this blog post. Make sure to give them a listen, because it’s the best way to figure out which text to speech sounds most natural to you. For reference, here is a sample from Elevenlabs, listen to this first, and then compare it with the 10 alternatives below!
With that being said, let’s take a look at the top 10 ElevenLabs alternatives!
10 ElevenLabs Text to Speech & Voice Cloning Alternatives
1. PlayHT
Play.ht is a service that offers “A new generation of voices almost indistinguishable from a human voice.”
Like ElevenLabs, PlayHT also offers voice cloning.
However, while they claim that their new generation of voices is almost indistinguishable from a human voice, I do think that ElevenLabs.io is slightly better a mimicking human speech than Play, but you can check that out for yourself by checking the sample below and comparing it to the ElevenLabs sample above.
The reason why I put PlayHT so high on the list is because their voice cloning and their AI voices are definitely not bad, and that they offer “unlimited” voice generation and downloads in their premium package, which currently costs only $50 a month, if billed yearly.
This “unlimited” package is restricted by a fair usage policy of up to 770 hours of Audio, but that’s still a good deal and much cheaper than most alternatives, including ElevenLabs, which charges $330 for 40 hours.
If you’re willing to compromise on quality and need quantity then Play.ht might definitely be worth checking out.
They also offer an API, which makes them a decent choice for developers.
Check out a sample of Play.ht’s Text to Speech here:
2. Resemble.ai
Resemble.ai is a text to speech platform that offers voice cloning as well as a marketplace of existing voices that were sourced from a variety of voice actors.
They have a decently generous free trial available that gives you 300 seconds of free speech synthesis.
If you need more than that, they have paid plans that are pay-as-you-go and cost $0.006 per second of synthesized audio.
Compared to ElevenLabs, which charges $330 a month for 40 hours in their largest plan, this is quite expensive. 40 hours of synthesized audio comes to $864 if you use Resemble.ai! You also do not get access to their API unless you’re subscribed to the Pro plan, but that is mostly for enterprises.
However, their voice cloning does work pretty well and their website is easy to use.
3. Descript
Descript is an AI-powered software that enables users to edit and produce audio and video content with exceptional ease. It utilizes advanced natural language processing (NLP) and machine learning algorithms to provide innovative features for transcription, editing, and media creation.
At its core, Descript offers a highly accurate and efficient transcription service. Users can upload audio or video files, and the software automatically transcribes the content, creating a text-based representation of the spoken words. This transcription can be easily edited through a user-friendly interface that resembles a word processor, allowing for simple text manipulation, such as deleting, rearranging, or modifying the text.
One of the standout features of Descript is its ability to generate realistic voiceovers. Using a technique called Overdub, users can create custom voice recordings by training the software with their own recorded voice. Once trained, Descript can generate new voiceovers that sound remarkably similar to the original voice, making it a valuable tool for podcasters, voiceover artists, or anyone seeking to modify or enhance recorded content.
Descript also includes powerful collaboration capabilities, enabling multiple users to work on the same project simultaneously. Edits made by one user are instantly synced and visible to others, facilitating seamless teamwork and enhancing productivity.
Furthermore, Descript provides a range of additional editing features, such as the ability to remove filler words, add captions, apply transitions, and even edit audio by manipulating the text transcript. This innovative approach simplifies the editing process, allowing users to make changes directly in the text and have them automatically reflected in the corresponding audio or video.
Compared to ElevenLabs, Descript has pretty decent Text to Speech. However, while ElevenLabs is focused completely on Text to Speech, in Descript it’s merely on part of their overall package.
What is nice though, is that Descript gives you unlimited access to Overdub is you sign up for their Pro plan which costs only $24 a month, making it quite a good deal.
One thing that I didn’t like very much is that while Descript does have an API, it is in closed beta and seems to only be available to enterprise customers. They require you to fill in a form to get access to it.
In addition, with Descript’s Overdub you can only clone your own voice, this in enforced by having you read a piece of text out loud in your own voice. So if you were planning to use Descript to clone celebrity voices, I’ll unfortunately have to disappoint you.
Here is a sample of Descript’s Text to Speech:
4. Coqui.ai
Coqui.ai is a text to speech software that’s quite similar to ElevenLabs. Like ElevenLabs, they offer a couple of existing voices, but you can also create your own voices and clone voices.
Compared to ElevenLabs, in my personal experience the result was a bit worse, but they do have a much more advanced editor that allows you to finetune the voices to a large extent.
They offer a free trial which gives you 30 minutes of Synthesis time. If you need more than that, 4 hours of Synthesized audio costs $20 or if you’re a real power user you can get 50 hours of synthesized audio for $175.
Compared to ElevenLabs, Coqui’s pricing is slightly cheaper. ElevenLabs offers 40 hours for $330. In addition, with ElevenLabs you pay monthly, and the characters you don’t use are lost, whereas Coqui.ai lets you pay for a certain amount of hours that are usable forever.
Like ElevenLabs, Coqui.ai offers an API, but only for paid users, free users will have to use the website.
Here is a sample of Coqui.ai:
5. Murf.ai
Murf is a Text to Speech platform that offers voice cloning, voice over video and a voice changer.
They have a free plan that allows you to try out their voices, but their free plan does not allow you to download anything you create, which makes it a bit useless.
If you want to download your creations, you’ll have to sign up for their Basic or Pro plan. These cost $19 and $26 a month respectively (billed yearly).
The Basic Plan gives you unlimited downloads, but only 24 hours of voice generation per year. In addition, the Basic Plan also does not give you access to their AI voice changer.
Their Pro plan gives you 48 hours of voice generation per year, access to all languages and accents, and access to their AI voice changer.
Overall, their Text to Speech is not bad and might be worth it for some people. To see if it’s for you, check out the voice sample below:
6. Synthesys
Synthesys.io is a platform that offers a variety of services, one of which is their Text to Speech service.
Like ElevenLabs, Synthesys.io offers voice cloning and an array of prebuilt voices.
For $27 a month, this platform offers unlimited voice-overs and access to 38 real human voices as well as 374 computer voices in 140 different languages.
However, while they claim that they offer “unlimited” voiceovers, the real limit is set at 120 minutes per day. If you use the full 120 minutes per day, you’d get a maximum of 60 hours per month, which is still a pretty good price compared to ElevenLabs’ $330 for 40 hours.
Unfortunately, Synthesys does not offer an API for their regular. Instead, everything goes through their web interface. For enterprise customers, they do offer an API though.
Here is a voice sample from Synthesys:
7. Tortoise TTS
Tortoise-TTS is a Python Library that many Text to Speech voice cloners are based on.
The library is completely open source and if you have a powerful enough pc, you can run it locally. However, when I ran it on my PC, it took a long time to synthesize the audio, and that’s with an RTX 3080! The quality was also not nearly as good as that of ElevenLabs.
There is also a fork called Tortoise-TTS-fast, which, while faster, still takes a long time to synthesize audio.
A word of warning though, if you don’t know anything about Python and programming, this is quite tough to setup. You’ll likely have to spend many hours and in the end you might not be very happy with the results. You’ll also likely end up running into compatibility issues that can be tough to fix if you don’t have a lot of experience.
The benefits are that it’s free to run and that it gives you a lot of freedom to experiment if you know how to code, but even after spending hours to get in running locally, I still went back to ElevenLabs just because it had better text to speech and much faster speech synthesis.
Here is a sample that’s created with Tortoise-TTS.
8. Microsoft Azure Neural Voices
Microsoft’s Azure Text to Speech is a powerful and versatile cloud-based service that converts written text into natural-sounding speech. With Azure Text to Speech, developers can integrate lifelike speech capabilities into their applications, products, or services, enhancing the user experience and enabling accessibility for individuals with visual impairments.
Azure Text to Speech boasts an extensive selection of high-quality voices, including different languages, accents, and expressive styles. This enables developers to create personalized and localized experiences for their users across various regions and demographics. The service leverages cutting-edge neural text-to-speech technology to deliver human-like intonation, pronunciation, and emphasis, resulting in a more engaging and natural auditory output.
However, while the voices from Azure are good enough for many use-cases, if you want a neural voice that sounds almost identical to that of a human, Azure might not be your best bet. The voices do sound a bit robotic at times.
One thing that Azure does do well is their pricing. They have a generous monthly free allowance of 500,000 characters per month, which is plenty for everyone but the most hardcore users. If you need more than 500,000 characters, you’re charged per character at a rate of $16 per 1 million characters. If we compare this to ElevenLabs which charges $165 for 1 million characters in their largest package we can see that Azure’s Text to Speech is a much more budget friendly option. However, this cheaper price does come with a few drawbacks. Their voices are markedly less lifelike and Azure does not offer voice cloning.
Here’s a sample of one of Azure’s AI voices:
9. Speechify
While Speechify does not yet offer voice cloning, they do offer a variety of included celebrity voices such as Snoop Dogg, Barack Obama, and Gwyneth Paltrow.
The reason why I have them listed as an ElevenLabs alternative despite the fact that they do not offer voice cloning is because the voices that they do have available are decent.
However, the drawback is that Speechify is more focused on reading web content to you through a chrome extension and their celebrity voices do not seem available in their voiceover app, which is a shame. Their AI voices are not too bad, but they’re not as good as those of ElevenLabs in my opinion. In fact, I’d even say that the cloned voices from ElevenLabs sound better than the celebrity voices from Speechify.
Nevertheless, Speechify is available for free, and is a decent option for people who are looking for their web content to be read to them in the voice of Obama!
10. Uberduck
UberDuck is another voice cloning Text to Speech software that allows voice cloning. Aside from voice cloning, they also focus on allow you to create music with AI vocals.
While they’re pretty decent, in my opinion, Uberduck is not worth the price. Uberduck’s cheapest plan (which doesn’t even include voice cloning) costs $96 a year ($8 a month) and gives you access to 3600 seconds of voice synthesis per month. For that price, I think you’re better off choosing the ElevenLabs starter plan which gives you 30,000 characters for only $5.
Here is a sample of Morgan Freeman’s voice created with Uberduck:
Conclusion
So, there you have it, 10 alternatives to ElevenLabs! I hope that this comparison was helpful and that it aided you in finding the best tool for your needs. Which text to speech do you think sounds most realistic?
Personally, I’m still a big fan of ElevenLabs, though I must say that Descript also did quite impress me, though it did have a few drawbacks.
Do you know any other alternatives? If so, I’d love to hear from you in the comments. Things in AI move fast, and new products are released every day. As such, I’ll try to keep this post updated with the latest developments, but I’d much appreciate your help with that.
Leave a Reply