I submitted a request for Norwegian Bokmål, and realised a complication which I'...

OfSanguineFire · on Dec 5, 2023

Are you autistic? I ask because this is HN where lots of people are, and choosing to speak the literary norm in countries with diglossia is often associated with autism. For example, foreigners in Finland are urged to quickly get to grips with puhekieli (spoken Finnish) because speaking kirjakieli (the literary norm) in everyday contexts, or writing it in chats, is “something only autistic people do”.

vidarh · on Dec 5, 2023

Not to my knowledge, though I may have some traits.

That said, in Norway the literary form is/was spoken on e.g. TV and radio similar to how RP (received pronunciation) is/was spoken on the BBC, more so (in both cases) before than now where dialects are more broadly tolerated. On top of that, in affluent areas of Western Oslo and adjoining affluent areas the dialect sits mostly within what is "allowed" in Bokmål, and actually mostly towards a more conservative end of the allowed range than where I sit, and it's somewhat political, in that more conservative forms of Bokmål historically tended to be associated with social status (or aspirations...).

It's unusual more in that the pockets and social groups where dialects that overlaps fully or almost entirely with Bokmål are fairly small.

My spoken dialect is within that spectrum, exacerbated by reading a lot of older literature at early age that used quite old fashioned forms of Bokmål, and picking up more formal language than many of my peers spoke through that, but I tend to be closer to the more affluent dialect in writing than spoken.

(EDIT: My spoken dialect would probably fit as a somewhat "posh" version of Urban East Norwegian[1] today, with somewhat more conservative word choices in places where contemporary Urban East Norwegian would have deviated from Bokmål in minor ways in the 70's and 80's by being somewhat more "relaxed" in ways that have since been accepted in subsequent adjustments of the rules)

If you heard me alongside my dad there'd be relatively minor differences between our dialects, and I'd probably sound marginally less formal as I adopted some spoken patterns from the more working class area I grew up in outside Oslo, while he at least when younger would be recognisable as having grown up on the Western edges of Oslo.

Beyond that, language has always fascinated me, and I tended to take a certain level of delight in torturing my Norwegian teacher who favoured the other official language - Nynorsk. Nynorsk and Bokmål overlaps very significantly, and more so after recent language reforms which have tended towards allowing more Nynorsk forms of words, or ones closer to them, in Bokmål. Our Norwegian teacher very much wanted us to use those forms (that'd be favouring "sola" over "solen" etc.), and I used to express my distaste for Nynorsk by instead exaggerating my preference for the more conservative Bokmål forms.

[1] https://en.wikipedia.org/wiki/Urban_East_Norwegian

sgt · on Dec 6, 2023

Riksmål is the word you are looking for.

vidarh · on Dec 6, 2023

When I was growing up Riksmål was far more conservative than what I spoke despite the fact that I spoke fairly conservative Bokmål, and it was still somewhat more conservative than how I wrote. I've not paid much attention to Riksmål, but I'm vaguely aware they've moderated themselves quite a bit.

However a quick check with Det Norske Akademi's dictionary shows that both my spoken and written Norwegian is still not full match for Riksmål, though I see they've pretty much "surrendered" and even accepted some -a endings, so it's getting close-ish.

Maybe in another couple of decades.

ftyers · on Dec 6, 2023

> If you want to maximize the utility of a dataset like this, you really would want to let each speaker at least assign a lot of tags/labels to their profile; even if you don't want to deal with the hornet nest of trying to figure out all the distinctions, even unstructured labels would be a start, and ideally allowing people to tag individual recordings as well, because there are a lot more variations than just "language" and "accent" here.

This is exactly what the freeform accent (actually "variant") field is. You can add as many tags as you like. https://foundation.mozilla.org/en/blog/how-we-are-making-com...

vidarh · on Dec 6, 2023

Then the guidance on the site really needs to be updated, as that's not what the help in the profile section says, and starting to type the auto-completing options didn't really give reason to suspect that either.

yorwba · on Dec 5, 2023

> Norwegian Bokmål

... is currently in progress. What's missing is a sufficiently complete translation of the UI https://pontoon.mozilla.org/projects/common-voice/ and a sufficiently large number of sentences for people to record https://commonvoice.mozilla.org/nb-NO/write

> let each speaker at least assign a lot of tags/labels to their profile

Common Voice data files have columns for age, gender, accents, variant, locale and segment. (Not sure what that last one is.) These are per recording, but I'm pretty sure they're the same for all recordings by the same speaker.

vidarh · on Dec 6, 2023

Weird to hold off on adding a language because the UI isn't translated. Why would there be an assumption that the language people want to record is linked to preferred UI language?

I don't want Norwegian UI - I just want to be able to record Norwegian sentences. If the UI switches to Norwegian I'd be very annoyed, as I haven't indicated I want that and my browser settings specify English.

(I avoid Norwegian for UIs, because the translations are generally wildly inconsistent in how they translate key terms that I'm used to seeing in English, so it's a massive nuisance - when people assume UI and content language should be the same, that is a major failing to me)

Re: tags someone else pointed out the accent field is being used for this, even though the UI describes that as specifically for accents.

ftyers · on Dec 6, 2023

[comment removed]

vidarh · on Dec 6, 2023

Frankly this does seem like a massive barrier to me.

It's certainly causing me to lose interest, and I suspect it's driving away a lot of people, not least because it was not at all obvious to me there was some way of speeding up getting a language in the first place.

It was already off-putting not to be given a way to write sentences or record right away.

But now that I know, I have no interest in wasting time contributing to a UI translation I actively don't want to be subjected to, but would happily contribute recordings and sentences on occasion if the language was enabled because the potential for speech recognition and tts utility is entirely separate in value from UI.

This whole approach feels really backwards to me, and the really short list of languages no longer surprise me.

EDIT: I see I actually have had it bookmarked a long time, and presumably lost interest once before due to the lack of my language.

EDIT2: As much as the Norwegian UI is already annoying me and I've already spotted at least one spelling mistake in it, and one translation that is "correct" that thoroughly annoys me, I'll see if I can submit some sentences at least.

yorwba · on Dec 6, 2023

> it was not at all obvious to me there was some way of speeding up getting a language in the first place.

Yeah, that's the biggest failing of Common Voice in my opinion. Getting a new language up to speed could be much improved by simply adding a few links to documentation, but even the existing links are broken, which I reported in March 2022... https://github.com/common-voice/common-voice/issues/3637

> I have no interest in wasting time contributing to a UI translation I actively don't want to be subjected to

Translating the UI may still help you get other people to record, even if you don't want to use it yourself.

> I'll see if I can submit some sentences at least

If you want to go faster, there's also a project to extract sentences from Wikipedia etc. in small doses Mozilla's lawyers and Wikimedia's lawyers have agreed are fair use. I think you'd only need to define how Norwegian Bokmål separates sentences. (E.g. after a period but not if it's a common abbreviation like "etc." in the preceding sentence.) https://github.com/Common-Voice/cv-sentence-extractor

ftyers · on Dec 6, 2023

Target segment was a was of including specific subdatasets. For example the digits dataset which was just the digits 0-9 and yes/no.

indigo945 · on Dec 5, 2023

This is a great argument.

I particularly agree with your point regarding English - my German accent sounds jarring to probably most native English speakers, but it should still be understood. To add to your argument, I have sometimes tried to turn on subtitles for Youtube videos in some accent of English that I haven't had much contact with (such as Nigerian English), but the auto-generated closed captions turned out to be even more useless than my own comprehension.

However, one should keep in mind that Mozilla's main goal here is accessibility, with the implication that they mean accessibility for blind and deaf people in particular - as opposed to accessibility for stunted multilinguals like us. For these purposes, being able to transcribe mainly mainstream uses of the language is fine, and so is being able to generate speech in a hodge-podge averaged dialect. I highly doubt most blind people care about whether their TTS engine speaks The Queen's English or not, as long as it is clear and understandable.

vidarh · on Dec 5, 2023

What is "clear and understandable" varies greatly, though. E.g. Nigerian English is often subtitled in the UK, but fairly often so is Scottish English... Both often to the great dismay of speakers of the two who sometimes are very annoyed at the expectation that people might not understand them.

Nigerian English is actually fascinating in that there's a whole spectrum from Nigerian Pidgin, which ranges from nearly unintelligible to English speakers, to "mostly British English" in terms of orthography and grammar, but which still tends to incorporate words from several differences Nigerian languages and pidgin. (e.g. abeg, don't give me any wahala; Please, don't give me any trouble)

Now consider Nigeria is about to become the country with the second largest number of English speakers worldwide (it's close to tied with India, depending which sources and level of proficiency you consider, and Nigeria's population is growing far faster than India's), and while it's still quite far behind the UK for people speaking it as their first language, with current population growth and increasing use of English (e.g. my ex wife's first language is English because her parents first languages were Igbo and Yoruba, and that kind of situation is driving adoption) likely to cause Nigeria to become the second largest on that measure as well.

So handling a broader range of dialects will matter, at least in terms of recognition - I do agree that there's more flexibility for generation, though even there if you try feed a broader Nigerian English pidgin to a TTS engine and it doesn't know what to do with the words it might well end up being unintelligible both to eg. American or British English speakers and Nigerian English speakers.