As someone who consumes a lot of Hal-9000 and produced audio, I would say they diverge into very different experiences. Former, especially if you settle on a robot voice is almost like "reading", in that that you can learn to apply personal imagination over "neutral" audio. It can be more active. Latter is more experiencing, producer influences how you think about work based on voice casting / performance. It can be more passive. Caveate being I think more writers in future will hop on AI voice / audio production once technology matures, and you'll have creators personally direct "definitive" editions of how their audio is suppose to be experienced.
In watching so many movies these days where the story simply cant be told without extensive CGI, and the fact that I personally, and likely many others are so tired of the crop of Human "celebrities" -- Its only a matter of time where I just ask for a full movie about a copelling story in a subject and World of my choosing and soon - if I want a Super awesome spy movie with lots of stealth and action and heists etc -- itll bea full bespoke movie experience.
So in book, as you mention, an author will be able to define the full experience - and AI will allow people to select the stories and movie experience they want on demand.
>AI will allow people to select the stories and movie experience they want on demand.
Yeah there's a bunch of movies/show I'm not interested in watch but fascinated by aftershow content, i.e. all the game of thrones podcasts/recaps. It's interesting the variety you can get in a 5 minute recap, a 20 minute one, or a 2 hour one. There's alot of mid movies out there that I'm satisified with just the recap. Ditto with all the channels summarizing entire seasons of TV shows. I'm just waiting for a selectable X minute recap/summary/synopsis slider for content. Bonus choosing how recap is done, i.e. I really like game of throne wiki because it feels more like history book with drama stripped out / toned down.
This is off the-cuff thinking outloud about a concept I cannot articulate properly, but the attempt;
>>I'm just waiting for a selectable X minute recap/summary/synopsis slider for content.
Use the depth of the tools allowed us at this point, and they are FN deep.
---
Hear me out:
Using AI to craft AI in order to Feed ourselves knowledge in the most compact, pragmatic and efficient manner.
BingeKnowledgeAsASynapse.injector.
--
Many people consume vid/pod/aud/wtd at 2x speed. I frequently run 2x and then you just <-- whenever you need to in order to recap a point should you meander down all the paths.
I want to craft a funnel of information and be able to tell the AI to teach me THING -- I've some up with a nascient framework for this - which is effectively a scaffold (think BootStrap) for getting Robots to do a bidding.
"Give me an FPS that I can launch on UE5 (installed) that will yank the world assets from [MOVIE] and allow me to play as [CHARACTER] and include the mechanics of [STEALTH] [D&D 52 rules] and skin it as [ANIME-THING] - and use the [GAME] as reference - but give it to me as a tower defence webapp""
[0] - First ever gamer on unreal, AGP, Celeron, etc...
--
Imagin walking into a game test lab and barking "Give me the latest build of X game on Y engine on Z console"
Yes! And books. Importantly, I want it to be a verbatim reading and mention e.g. in-between sections that "it's now reading an info box", or describe a picture, etc.
I tried this by getting Claude to write the text (based on the pdf) and then putting that into elevenlabs for audio (quite expensive) and for what it's worth that worked quite great.
I have found huge gaps with how digestible different audiobooks are.
It's so painful when there is a book I am really excited to listen to and find that I just can't. Maybe I just mentally drop out mid-sentence or the words float past without carrying any meaning and it can be difficult to pin down why.
One aspect is the performance. An accent or an intonation can cause a nails on blackboard type revulsion. A morose, monotone or bored voice, ugh? For example, I just can't do any Scott Brick performance (which is pretty annoying given how prolific he is) because he trails off every sentence in this weird pitch change that doesn't match how any normal person speaks. It causes me this distressing mental dissonance because it's so disconnected to the passage being read. I suspect it's a result of him just reading out words blind and not talking like a human trying to communicate specific meanings and feelings.
This is also the problem with robot voices - the difference between a joke, a question, an assertion, a doubt, sarcasm or self-delusion can all be in the intonation. With text you might pause and revoice a passage to catch a meaning but the conveyor belt of audio means the next words are upon you giving no chance for reflection. The place where I can handle a robot voice is when using it with the text in front of me. I quite like it as a pace setter. I still have my own internal voice giving flavour but gain some reading endurance when tired or the material is boring/dry and I keep mentally wandering off.
The other aspect is the style of writing - some writing is delightful to read but awful to listen to. The parameters includes everything from sentence structure to vocabulary. A simple example is lists: perfect for quickly imbibing visually but can be maddening to hear read out. Same with numbers, or parentheticals or footnotes or... so I generally can't listen to anything technical.
The most listenable things for me is anything in a conversational style. However, even dialogue dominated works can be torture when they use certain patterns of "he said"/"she said" directions that are effectively punctuation and silent when we read but overwhelming when voiced. A first-person diary style often fits with being spoken without changes. There is a probably an AI opportunity to rewrite books to more resemble radio plays that work better as voice performances.
IMHO voicing technical works needs some brand new ideas. Real life conference talks that just narrate their paper are torture to sit through. Successfully engaging speakers use personality, humour, anecdotes and structure talks playfully with setup/reveals, hypotheticals, pauses for thought and have all sorts of rhetorical tricks to bring drama to the driest content.
The most painful audiobooks I've found are the ones that are volunteer narrated. I think it was for Librivox or something like that? Holy hell it is bad every single time.
But yes, there are some professional narrator that I am baffled as to how they get paid to narrate other audiobooks because they just cannot convey the story. It is such a disappointment when, as you mentioned, you are excited about a book and you run into one of those narrators.
I also don't know if the crowdsourced skiplist works for podcasts as well as for youtube ads, since I think a larger proportion of podcasts use dynamic ad insertion - do they have consistent length? Might mess up ad tags/flags
I haven't tried, but youtube music podcasts can redirect to youtube interface, which I assume will work on 3rd party apps like newpipe with sponsorblock integration. Sponsorblock mostly helpful in skipping sponsor reads, 3rd party apps (or youtube premium) blocks injected ads.
Ultimately, IMO there's too many niche podcasts out there for sponsorblock route. Better to have AI solution to figure out and tag ads + sponsors + filler content to autoskip according to timestamp on downloaded audio file podcasts which could be variable length due to localized ads.
Seems like a lot of podcasters haven't migrated from google music / google podcast to youtube music podcast. The ones that do does have youtube interface but I rarely see any get tagged by sponsorblock. Honestly one of the things I was looking forward with the google podcast -> youtube music podcast clusterfuck.
what are you listening to podcasts on that it's not easy to skip ads?
I'm using Downcast. It's a fairly old app. I think it defaults to +30s +2m -30s buttons which IIRC are settable. It's trivial to skip the ads. I don't need AI to solve that.
Usually I'm listening to podcast when doing dishes, or grocery shopping, at the gym, or on a walk. I don't really want to have to take up my phone to skip an ad, and the controls on the headphones are too janky. This would definitely be a killer feature for me.
"Siri, skip ahead 30 seconds". Doesn't always work, and... sometimes, it's 60, sometimes it's 2 minutes... If it's a podcast I listen to a lot, I usually know if it's a 30/60/120 second ad spot. Even the voice command to do it isn't always convenient though, I'll definitely grant.
as far as I understand it, ad blocking on podcast doesn't cost the creators anything The way it does for blogs and newspapers. Podcasts are downloaded and apps generally don't report back on which seconds you listen to and which seconds you skipped. ad revenue is largely based on the number of downloads in a particular region.
Not sure why this is downvoted. Ad blockers aren't illegal, no one has the obligation to listen to ads without skipping it. This isn't really controversial on HN.
It's hassle when you're out/active/preoccupied/sweaty/winter with unreliable touch gloves. Sometimes it's 2 clicks for a short ad, sometimes its 8+ for long one. I consume a lot of podcasts and sometimes use a bluetooth remotes with tactile buttons. If only smartwatch like pebble still exit can have dedicated skip forward/backward button. I don't know if you listen to any popular podcasts that's also on youtube with audience sponsorblocking (like WAN show), especially with "filler" youtube gaming algo skipped. It's a much better experience eliminating 1000s of skips from your life.
Good idea. I was originally supposed to support transcription in the podcast RSS specification, but since there are not many podcasts implementing this field, I have not yet added this support.
Hello, the Google Play version is still in alpha testing. You can join googlegroups through the link on the home page to get access to the alpha version.
>Export your transcripts and subtitles to your notes app.
Would be nice to be able to TTS translated transcriptions. Insert tags to switch between voices on normal TTS, and eventually move to device side generated natural voices that matches original cadence.
Yes, I am planning to implement this feature. In the future, there will be more AI-related features, such as AI recommendations, support for more languages, and a more versatile podcast knowledge chatbot.
As a pocketcast lifer, I wish you success. Does Are you aware of listennotes who transcribes $1+0.06c per minute? I like the idea of 50 transcriptions per months, but there's like 30m podcasts and 3 hour podcasts. Eitherway feels like there's ~5 years where AI podcast have good subscription market before device side transcription/translation/natural voice. Hope you grow big before then.
Agreed, I also think local transcription is the ultimate solution. For now though, whisper.cpp is still too slow on cell phones. It may take some more time to develop.
I haven't followed listennotes' transcription feature, I'll check it out. I've tried their catalog and search API, and many popular Chinese podcasts are not included.
Look forward to Chinese podcast integration especially. TBH PRC podcast ecosystem is just so far ahead. There's nothing close to Ximalaya as an audio platform/economy. Podcast keeps getting bundled into music or video streaming platforms in west, when it should be combined with audiobooks, other spoken word content into it's own thing.
Yes I am very impressed with the summary function and the chapter assignments which are amazingly useful to jump around to find the best of parts to go back and snip. Export text to note apps is also huge once the collection is big enough over time.
Yes, snipd is very good. But for me: 1. It does not support Chinese transcription 2. It is more like a note-taking software rather than a podcast software. Therefore, I still developed this app.
One friend wants conference papers turned into audio.
Another wants fan-fiction read dramatically.