Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Offline voice messages transcription in Signal Desktop (a2p.it)
191 points by dexterp37 on June 14, 2022 | hide | past | favorite | 71 comments



For everyone on Android: a lot of Android phones have an auto-subtitling feature now that you can turn on by tapping the volume buttons, and pressing the captions icon. Works great for things like this, apparently it's what they built it for.

https://support.google.com/accessibility/android/answer/9350...


Unfortunately English only...

I use https://t.me/transcriber_bot god knows what they do with my personal stuff though...


The help post I linked states that it supports English, French, German, Italian, Japanese, & Spanish on Pixel 6+.

I'm on an OnePlus and I've not had it show anything but English correctly, so yeah, it's probably not enabled for other OEMs. Maybe they use their special AI hardware for it on Pixel?

It does say though that the detection is completely on device and none of the data is sent to Google, which is nice - maybe Telegram can add a feature like this, with some of the recent open source models that can run on embedded devices.



They use wit.ai for the actual transcription, which belongs to Meta (i.e. Facebook).


So, does that mean the answer is "use it and sell it at will"?


TIL, thanks for sharing! My phone is not supported (Fairphone 3+), sadly.

While such a feature would definitely help with the transcription of a message, unless the transcribed text is stored within the app it would not help with searching the messages.


I use an older version of Android so I don't have this feature.. but you can still work around it by using live transcribe or otter.ai instead, if it plays through the speakers it will be heard (they will not work with headphones connected). These machine-based apps are a blessing because previously, it was necessary to ask someone else to tell you what the message said, whether a relay operator or in person. They're certainly good enough to let you know instantly whether it's spam, a wrong number or something important.


And iOS 16 will also have this feature coming this fall, can be linked to a triple-click of the side button


TIL, never knew what that button is for, thank you!


This is such a neat solution to a problem I am facing every day. Especially in an office environment it's simply not feasible to listen to voice messages. What ends up happening is that they simply get burried under all the "normal" messages and forgotten.


It might not be immediately obvious, but on WhatsApp at least, you can hold your phone to your ear and the voice clip will play as if you're on a phone call. I am making an assumption that Signal would be the same.


I always thought this was an iPhone feature, not strictly WhatsApp. Back when I used WhatsApp I tried this with my Android phone and couldn't get it to work. Definitely possible that I overlooked something crucial though.


This is really cool! It's one of the features I miss in every single chat app. The fact that its done locally here is a nice privacy bonus.


Slack has that feature. If you get a voice message you can either play it or display the transcript. Although I don't think it does it offline.

Edit: only on paid plans


WeChat give this by default (but not for all languages and you cannot change the language of a given voice message). But it’s so slow because it does it in real time, so a 1 minute voice message (the maximum length in WeChat) takes 1 minute to transcribe.

I still don’t understand why the OS doesn’t provide that function in all instant message app: just show “transcribe with Siri” on any audio message on any app and be done.

At least as an accessibility setting for those who cannot even listen to those messages.


that's the future, deliver more and more open ai models as if they were just software libraries and go back to native offline capable apps.


Great feature. Good luck with making it production ready and merging it upstream if you intend to do that :)


That's amazing!!!

Is there any possibility it's gonna been added to signal desktop?


I can appreciate the effort and think it is an amazing idea. However, why build that into Signal which is E2EE and use an API for the speech-to-text?

It would be much more logical to do it with Messenger (for example by injecting JS with Tampermonkey) or to use a local model to perform S2T.


Scribbn is a great app that does speech-to-text for WhatsApp and other messaging apps whenever I can't listen to voice messages, e.g. during meetings - or when I'm just not in the mood for listening to voices.

https://www.scribbn.de/


Amazing! Every chat app needs this. I use WhatsApp daily and this is in dire need.


I would settle for an option to automatically reject voice messages.


You'd miss so much important stuff though! "Uhhh yeah so I wanted to reach out to ask you about, you know how last week we - hold on dude I'm sending a voice message - yeah so uhhh what was I saying..."


I'm going to have to look into this. I've been working with Pocket Sphinx and the accuracy has been terrible (hasn't gotten a single word right).

Anyone know if this approach works OK with background noises?


Texts.com supports offline audio transcription for all platforms. They say it's powered by Apple's speech recognition APIs.


Why is nobody talking about how many situations will arise where one person sends a voice message with a certain tone only to be completely misrepresented by someone using this context-killing feature?


I would guess that if something seems rude or off then they might actually play the voice message.


Hmm.. site blocked for phishing by my corporate fortiguard.


> I get why people send voice messages: it saves their time, allowing them to record their voice while doing other things.

This is true, it does save their time, at the expense of my time. Now I have to spend two minutes listening to your pauses while trying to collect your thoughts, rather than spend 15 seconds reading the final thing. I also need to make sure I'm in an environment where it's socially acceptable to listen (ie when I'm alone, or have headphones in), so it's much more of a burden to the recipient.


If someone doesn't make the simple effort to synthesise their thoughts in a text message short audio message, and instead of that, make a long audio message without too much content, I won't be the person who make the effort to listen to a 2-5 min. of audio.

Fortunately, most of the people I care don't do that, and that only happened to me with others. Anyway, I find this as a lack of respect to the other person for wasting their time.


This is why I can’t get into podcasts. Too much fluff.


This reminded me of when I started listening to a new podcast and also tried a new podcast app at the same time. The podcaster had this unusual editing style where he'd cut all the pauses out of his podcast and I really grew to love it. It was only when I tried a different podcast that did the same thing that I realised I had set the app to cut out all of the silences.

After that I've been unable to go back to normal speed podcasts - I just don't have the time or patience.


What app was that, out of curiosity?


Podcast Addict [0] has that feature but only runs on Android devices. I use that as well as speeding up podcasts to 2x speed which makes things better for me.

[0] https://play.google.com/store/apps/details?id=com.bambuna.po...


The app was Google Podcasts [0].

[0] - https://podcasts.google.com/


A variety of apps can do that. Pocket casts and Podcast addict come to mind


Overcast will do it on iOS, if you're not Android.


Edited podcasts (as opposed to "two dudes in a room talking") don't have this problem.


That's not the whole story though, some edited podcasts I find hard to listen to _because_ they're (heavily?) edited. Some people-having-a-chat-podcasts work well for me.

I guess I can't stand too much 'energy' in the audio, be it from editing into something fast paced, or overly enthusiastic and dramatic voice (acting). I'm looking for something where content speaks for itself, rather than the voices.


I have avoided podcasts entirely until this year when I found out about Darknet Diaries. It’s fantastic and I binged all 117 episodes on my drives.

On the other hand, I tried others…and they are mostly fluff. Eventually I will probably find another but haven’t come across much so far.


I changed my voicemail several years ago to say "Hi, you've reached NAME. There is no voicemail on this service, so please send me an SMS or Whatsapp message, or email me EMAIL. And have a great day!"

I still get a voicemail notification every few weeks, though I have no idea from who because I have absolved myself of ever listening to them again.

Unlike having no voicemail, I can still provide a personalised message to the person calling me (and as a business owner, I do want people calling me).


I wish most carriers allowed you to record a voice message, but actually have no slots to record so that people can't actually leave a message.


Exactly. It often takes weeks for me to reply to a voice message. Simply because I don't feel like listening to it. Sometimes I don't reply at all and never listen to it. My time is important too you know.


Err, have you asked them not to do that? About the only communication use for my smartwatch I've found is sending people voice snippets, because heaven knows you can't type on those screens. It's not too hard for the recipient to read them, and if you turn whatever you were going to type into your voice snippet there's no issue with length. (This genuinely seems useful for the people who get a cell-enabled smartwatch. Could exercise without anything in your pockets.)

Can also read text messages; can't see pictures very well, though they try. The speech to text is another option but less reliable than just sending your voice. It's gotten pretty good though.


Could just be me, but in some cases I find it easier to explain/understand complex ideas succinctly with a voice note.


Other than you trying to replicate sounds (eg. music), I have a hard time believing that claim. For normal speech, text is essentially a 1:1 recreation. It might be faster for you to record the same amount of words than to type it, but that's not the same as "succinct", and it ends up costing time for the other party (as the patent poster mentioned).


> For normal speech, text is essentially a 1:1 recreation.

More of a 1:0.5 or even 1:0.25. There is a lot lost in text to speech,such as tone and volume two very aspect of speech, that affect what is said and attempted to be conveyed. Not to mention their are certain little tidbits you can do in speech that is not a lossless translation. I don’t disagree that voice notes take up more time than text, but Speech and text are far from 1:1.


Can we have device local audio transcription from voice messages to text? I guess you could already do it using some cloud service, at least for popular languages. But I would prefer not to do that for privacy reasons. I'm not so sure about how well my native languages (Finnish & Swedish) would fare.

IIRC android has a voice keyboard. So instead of sending voice messages, dictate and send text. But that would require effort from the sender.


Don't these problems also exist for the sender (more time consuming, socially acceptable environment to send, etc). IME I have found texting more time consuming and personally prefer calling people, though that's a bit old hat now.


They do, but the sender-pays model aligns incentives better than both-pay.


I love the ability to have an async conversation this way though.

Take the time to pause and collect your thoughts before sending the message. It'll help you build the same habit when you have in person conversations as well.


I agree with some of the issues but one good feature I like - just listen to the voice note at 2x the speed. And usually people are concise with their thoughts.


Ah yeah, that's a good trick. Maybe I'm biased because I have one friend who usually sends me voice notes where he tries to collect his thoughts, so there's a ton of pausing.


I find this response baffling (as well as some of the replies) and it illustrates to me how people treat their devices differently.

I see my phone as a way to connect with people rather than a way to simply communicate with people. The fact that I might have to listen to people pause and form their thoughts is a positive of voice notes, not a negative. I don't ask my friends to provide transcripts before I meet with them in-person to make sure my time isn't wasted when we speak.

I understand that if you are receiving work-related voice notes that are better suited for text then that would be frustrating, but hearing my friends speak is so much more preferable to just reading their words on the page.


To each their own, I guess. I usually chat with multiple friends and do other things at the same time, so to have to stop everything to listen to a voice note of someone pausing and trying to figure out what to say feels like they're intentionally making me wait for their convenience.


> they're intentionally making me wait for their convenience

Lol, it's definitely possible to do this with text too "typing..."


Not familiar with Signal, does it require you to hold something down as it plays?


No, but if you leave the screen to read another message it stops (plus my brain can't process more speech if it's already listening, so I have to stop all speech-related tasks until your note is done).


I agree with the sentiment that hearing someone's voice creates a connection. However, I do not understand why people use voice messages instead of calling so you can actually talk to each other.

Why is that?


For me it's the asynchronous nature. You can essentially have a conversation throughout the day (or over the course of a few days) without the pressure of having to organise a call at a particular time that works for both.


Asynchronous communication is easier for busy people, typically


Personally, I think that asynchronous communication is more efficient is a fallacy. In a verbal, synchronous conversation (a phone call), you can communicate extremely efficiently. Misunderstandings or doubts are detected and clarified instantly. On the other hand, finding a moment to listen to a voice message, listening to it, responding later etc. seems to take _much_ more time in total.

This is of course highly subjective. You might prefer the ability to choose the most suitable moments for communicating over communicating efficiently. My impression is, however, that the asynchronous type of conversation just stresses people out because the initial illusion of "just a quick message" turns into an endless back and forth of clarifications. I cannot imagine a meaningful communication on any level that does not require any kind of clarifying question, queries etc.; be it even just for expressing empathy or confirmation.

I wonder if there is any linguistic research on verbal asynchronous communication to make the matter less subjective.


> asynchronous communication is more efficient is a fallacy

Nobody said it is efficient. It's good for certain situations. People are doing various things throughout the day and they don't always have time to hop on a call.

But of course, when both parties are completely free and asynchronous communication hits a bottleneck, you can easily switch to synchronous.


> In a verbal, synchronous conversation (a phone call), you can communicate extremely efficiently. Misunderstandings or doubts are detected and clarified instantly.

My anxious brain might just work differently here, but I’ve found that I communicate way more efficiently if I have a minute or two to turn something over in my head before I need to respond.

Even if I don’t grok it immediately, my clarifying questions will usually be much better because my mind has had time to work through the possible second-order effects of whatever we’re discussing without - and this is key - the social pressure of making someone sit and wait while you think.


True.

I didn't answer three phone calls and ignored like 10 messages from a co-worker from different department since yesterday afternoon.

Today afternoon I finally received an email from him where he told me that he'll do exactly that I suggested in Jira ticket comment two days ago, and that my further involvement is not needed.

I guess if I answered any of these communication attempts, we would still be discussing the solution to a problem that is already known for more than two days now.


I saw a stat from meta saying that 7B audio messages are sent daily via whatsapp. Is this a big internationally? How is this used? Naive me thinks its how old people use chat.


I believe it's region-specific. In Italy, in my experience, most teens communicate through voice messages (citation needed :-) ).


> I get why people send voice messages: it saves their time, allowing them to record their voice while doing other things.

Is that the case? I know in signal on android, the voice typing features of keyboards are disabled as a security feature because these leak your message to google ASR apis etc. I use voice typing to save time and sore thumbs tapping on a touch screen. I can see how when that's disabled people might just record a message and send the audio instead. Maybe the solution here is to avoid the voice message all together and enable offline voice typing in the first place.


While I love the project linked here, "voice" messages allow you to do more than talk. I've sent messages where I'm singing, or also where I'm reading a book chapter. It'd be sad if those things weren't possible anymore.


just checked and google-based voice typing seems to work fine in Signal for me (though I don't use it)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: