Translating in chuncks
  • I'm been an avid TED watcher online, but only a newbie at translating. Here's my problem: sentence structures, or simply: grammar.

    I'm translating talks in to Chinese, however sentences said in English are sometimes backward if translated not word-for-word, rather "chunk-by-chunk". [It's say like talk I this.] Yes... very confusing.

    That's not a problem in it of itself, except the dotsub platform doesn't give my the translator the freedom to toy with the timings of text. Though I don't have to do the tedious work of synchronizing the text with the video, it also means regardless of the language being translated into, it's stuck with the original language's timing of words. Chinese sentences don't take a lot of room on screen, so if only I could have the entire sentence showing for the duration of the sentence being spoken, problem solved. Voila.

    Is there a solution for this at all? For example, could I contact someone with the privilege of tinkering with these "deeper" functions? Or am I blind and I just missed something here?

    Also, and this is only marginally related: how about a preview feature, so we can see how well it works 'in action'. thanks.

    p.s. Sorry, should have browsed the forum first before posting this, here's the other thread talking about this with just about the same issue:
  • 13 Comments sorted by
  • "Chinese sentences don't take a lot of room on screen, so if only I could have the entire sentence showing for the duration of the sentence being spoken, problem solved. Voila."

    MrMen, you can try to copy the same Chinese sentence in several consequent lines, eventually it will do exactly what you want - keep it on the screen for the duration of the sentence being spoken.
  • About copying lines - the lines will show with no noticeable gap in the player (so it will seem it's just one subtitle displayed for a longer time), but there may be noticeable "flicker" in other players when the subtitle changes (even when the text is the same).

    Also, note that sometimes it is important to tie the subtitle to what is happening on the screen. E.g. sometimes you don't want to reveal some piece of information too soon (e.g. when it is important rhetorically). Or maybe when the speaker says a word, they point to something - then you should ideally wait till they do that with the subtitle, not just use one long sentence that will display ahead of time.

    Note, however, that you don't necessarily need to follow the English sentence. If you translate by sentence or by clause, you can often use the wording that works best in your language. For example, let's say that the English subtitles are:

    I will do it again,
    next year.

    If your language does not allow the time adverbial (next year) in that position, your subtitles could be:

    Next year,
    I will do it again.

    This is often possible if:

    * the original spoken line does not contain a word that would be recognizable to the audience in your language even without knowing the original language (e.g. a word like Coca Cola or a proper name like John Smith). Otherwise, the (hearing) audience will recognize the item in the original spoken line and will be confused as to why it was not included in the subtitle.

    * the original spoken line is not tied with anything visual on the screen in a meaningful way. For example, if the speaker showed a slide with a year number while saying "next year", you may decide that it's better to rephrase the sentence to include some kind of reference to "year" in the second subtitle.

    The second rule is more important.
  • But watch out with timing gaps. Make sure that the sentence is shown continuously, without a short gap. I had this effect in a dutch translation and it made it a bit annoying to read. I solved this by cutting up sentences.
  • Copying chunks is a workaround and has another detrimental effect of breaking the semantics of the format. Suppose some day some clever machine translation scientist would like to use high-quality TED subtitles as a parallel source. Copying subtitles would make his life really hard. I would not do it. The proper solution should be developed, not a workaround.
  • I agree, and this will not be necessary in the future. However, as long as the scientist is clever enough, she can add a rule to parse a subtitle that has the same consecutive line for different subtitles into one subtitle for the consecutive lines.

    For example:






  • You are right, it is quite easy to handle.

    The point is that the *scientist has to be aware* of this kind of pattern in a dataset. And such things are often not documented anywhere, therefore more often than not they are not discovered, or if discovered often it is too late. This happens regularly when semantics of the data format is broken.
  • Well that is possible, but I can't imagine anybody working with corpora without bothering to learn about the sources.
  • And again you're right :) And by looking at the translating process it seems to be of pretty high quality and trustworthy - reviews, etc. You might even skim through some subtitles with languages you know. But how lucky you should be to notice such repetitions? :)

    And, is there any reference to this thread in a more prominent place than the forum? Another thing is that quite often metadata and data travel separately. For example, how easy it is to go from dotSub (data, one place where the dataset of subtitles can be accessed) to this thread (metadata, description about the dataset)? Any reference? Can you estimate the chances of a researcher downloading this dataset and finding about this trick? :) Pretty low I should say.

    Again: This breaks the semantics of the format and until this trick is mentioned in the format or dataset description it remains as an obscure workaround. However, sometimes workarounds make it to "de-facto standard" :)
  • I am sure a researcher would bother to read a few translations (if not more!!) and they would come across it. Also, even in the simplest tools that I use (e.g. ApSIC Xbench), there are tests in place to find multiple translation units that have a single translation. I don't think that doubling subtitles would be a problem if the researcher exercises at least the minimum necessary care ;)

    Also, I understand where you're coming from - I discussed this research-related drawback of repeating lines previously in the I Translate TED Talks group. But this is just a potential drawback and the solution is obvious :)
  • Oh, didn't know about that discussion. Can you point me to it?

    If that was discussed and there is an obvious solution, then I would consider the trick widespread enough and on the way to "de-facto standard" ;)

    The need to exercise minimum care raises the bar and limits the potential use of the dataset. But well, research use seems not to be the primary goal here.
  • The solution is what I proposed above :)

    Do you really mean that the need to exercise minimum care raises the bar? :D I am not sure that any research without minimum care would be worth anything ;)
  • Minimum care in this, plus another minimum care in that, and the series does not converge anymore. And you have a calculus of care ;)
  • Wonderful information people! Thank you!!