Wordfast Server user manual

Appendix 5:	The Wordfast Server TM format

Appendix 5: The Wordfast Server TM format

WFS uses the Wordfast Classic TM format. A WFC translation memory is a tab-delimited text file. It's the simplest of all formats - it can be opened with text editors, like Notepad, or unicode-compliant word processors, as well as with Excel. WFC TMs can be regular ANSI (8-bit) text, or Unicode UTF-16 (both little-endian and big-endian).

A Translation Memory (TM) is a set of lines (paragraphs) of text. In a pure text file where the display does not wrap, lines are paragraphs. The very first line is a header, and all other lines are Translation Units (TUs), sometimes called "entries". Lines/Entries/TUs are sets of fields, a field being any text (even lack of text, which denotes an empty field) followed by a tabulator. In other words, the WFC TM format is Tab-delimited Text, which is arguably one of the oldest, most robust, open, easy to manipulate data format ever. In the header (the very first line in a TM), each field begins with a % (per cent) mark.

Fields making up a TU:

Field	Example	Format	Remark
Date	20041231~165410	yyyymmdd~hhmmss - the example here means 31 December 2004, at 16:54:10, local time. See note on the tilde ~ character further below.	Optional field: can be empty
User ID (Attribute #1)	YAC	Initials of the TU's creator.	Optional field: can be empty
Counter	5	A number between 0 and 9999 that records how many times this TU was proposed as a 100% match and accepted, meaning, re-used, as it is.	Optional field: can be empty
Source language	EN-US	TMX-compliant language code (but case-insensitive with WFC). It is made of a two-letter ISO language code, and optinally, a dash followed by a two-letter local variant.	Optional field: can be empty. Rule: field cannot be longer than 5 characters.
Source segment	Red Riding Hood was walking in the woods.	The source segment. Maximum size: 8000 Unicode characters.	Should contain at least one printable character.
Target language	FR-FR	Language code, TMX-compliant	Optional field: can be empty. Rule: field cannot be longer than 5 characters.
Target segment	Le Petit Chaperon Rouge se promenait dans les bois.	The target segment. Maximum size: 8000 Unicode characters.	Optional field: can be empty
Attribute #2 (optional)	EL	A mnemonic (maximum length=64 characters; no space allowed) for user-defined attributes. See WFC's "Sample" attributes for typical values, for example, client, domain, job number, department, etc.	Optional field: can be empty+tabulator omitted
Attribute #3 (optional)	PS		Optional field: can be empty+tabulator omitted
Attribute #4 (optional)			Optional field: can be empty+tabulator omitted
Attribute #5 (optional)			Optional field: can be empty+tabulator omitted

Here are the first two paragraphs (the TM's header and first Translation Unit) of a TM where the TU is defined as in the table above. Paragraphs are long, so they may wrap in your display - but there are only two paragraphs:

%20041231~160445 %YAC, Yves A. Champollion %TU=00000000 %EN-US %WFC TM v5.0 %FR-FR %87412764

20041231~165410 YAC 5 EN-US Red Riding Hood was walking in the woods. FR-FR Le Chaperon Rouge se promenait dans les bois. EL PS

When reading a TU, WFC defaults on the side of optimism in case the TU does not look correct or canonical. When in a TU:

the date is missing: if WFC is executing a loop that parses TUs, then it will take the previous TU's date and increment it with one second, otherwise, it will take the local machine's current date and time;
the user ID is empty, WFC will assume the TM header's user ID. If it is missing, WFC will use the user's identity as defined in Ms-Word. If it is missing, WFC will use XX;
a language code is missing or incorrect - but less than 6 characters: WFC will use the current TM's header language code (the code in the first line of the TM).

Fault detection A faulty line or TU is determined by counting how many tabulators are in a line of text. A line of text with less than 6 tabulators cannot form a valid TU. Another fault-detection method used by Wordfast is that language codes should not be longer than 5 characters. When language codes of more than 5 characters are encountered during a TM reorganisation, it is an indicator that something is amiss with that particular TU, and it is assumed to be faulty. Most Wordfast programs do not halt on faulty TUs, they simply ignore them.

Remarks:

The date does not necessarily have a tilde (~) separating date and time. Any printable character can be used there, except a number. WFC uses the tilde (~), and the equal (=) sign. The equal sign means the TU was "marked" (flagged) by WFC's data editor. This has no consequence at all on the TU's status: it remains fully valid. Although WFC always records the date and time when writing a TU, the date and time are optional and could be empty (or even made of an invalid date) in which case WFC would simply assume the current computer's date and time. All dates and times are "local", taken from the local computer's clock.
If any optional field is left empty, its trailing tabulator should be present. For a TU to be valid, there must be at least six tabulators, with the fifth field (the source segment, located between the fourth and the fifth tabulator) made of at least one printable character.
The date's first character (a number from 0 to 9, usually, a number 2 if the TU was created in the current millenium) can appear to be "x". This means that this TU is not valid anymore. The first full reorganisation of the TM by WFC will erase this TU. Do not remove the "x", or replace it with a number, unless you know what you are doing.

Placeholders as tags

Placeholders are used to encapsulate a few special characters, or tags. A WFC placeholder always has the following format: &tX; where X can take various values: &t=; &tA; &t1; &t#; , etc.

&t1;	a placeholder for a Word graphic;
&t2;	a placeholder for a Word footnote/endnote;
&t9;	a tabulator mark;
&t#;	a manual line feed;
&tA; &tB; &tC; ... &t¥;	constitute 100 placeholders for tags;
&t=<some tag="here">;	records an "unknown" tag. Unknown tags are found only in a target segment, but not in the matching source segment. Colons in the tag are escaped with a backward slash \.

Note to engineers

The ampersand, quote, greater, smaller (& "< >) characters are not escaped. The WFC TM format is not a member of the SGML/HTML/XML family.

Limitation: A WFC TM would create a slightly fuzzy match with text containing &tX; as literal text, as in this very paragraph. That is a minor and non-lethal limitation, which, to our knowledge, has not happened in a decade.

Tags in a WFC TM

When dealing with so-called tagged documents, a WFC TM records placeholders for tags. Those placeholders have a &tX; format, where X is the order of appearance of tags in the source segment. The X order is noted A (ANSI decimal 65), B, C, etc., up to ANSI decimal code 165. Thus, there can be no more than 100 tags in a WFC segment.

For example, the following tagged source segment:

would appear, in a WFC TM as:

&tA;This is some text.&tB;

At translation time, when WFC pulls a TU from the TM and is about to propose the TU's target segment as a translation candidate, WFC uses a substitution algorithm to dress the proposed target segment with the full "real" tags, taken from the document's (not the TM's) source segment, using a triangulation method:

Document's source segment <^—> TM's source segment <^—> TM's target segment

The triangulation can be successful only if all target tags have a "parent" tag in the source segment. This is because, at translation time ("leverage" time), only the new source segment, and the target has to be worked out by the machine. In other words, it's not a problem if the TM's source segment contains tags that do not appear in the TM's target segment. The reverse is a problem, however. If the TM's target segment has tags that do not appear in the TM's source segment (orphaned tags), WFC records the full syntax of these orphaned tags at TU creation time, so that they can be restored properly at translation time, when the target segment must be proposed with the correct format. If we have, at TU creation time:

In source segment:	<FONT FACE="Arial">This is some text:
In target segment:	<FONT FACE="Arial">Voici du texte :

then the target segment would be recorded in the TM as:

&tA;Voici du texte&t=&nbsp\;;:

where &t= opens the original tag syntax (  in our example) and ; (colon) closes the sequence.

Other examples of segments:

In source segment:	<FT>This is some text<AR> here<FT>.
In target segment:	<AR>Voici du texte<FT> ici.
In TM TU source:	&tA;This is some text&tB; here&tA;.
In TM TU target:	&tB;Voici du texte&tA; ici.

In source segment:	<FT>This is some text<AR> here.
In target segment:	<AR>Voici du<AR> texte<X;X> ici<FT>.
In TM TU source:	&tA;This is some text&tB; here.
In TM TU target:	&tB;Voici du&tB; texte&t=<X\;X>; ici&tA;.

In most translation memory systems, TMs are bloated with tags that do not belong there. Engineers may have overlooked that a TM takes significance and value when its content is put to use, meaning, when its past translations are leveraged for a new transation project. The point here is, leveraging TM content is done in the presence of a new document to be translated. At that point, the program can operate a triangulation between a new document's source segment which contains the new formatting, and an existing TM source/target pair which contains previous formatting placeholders.

To make things worse for formats that are obsessed with recording full formatting information, the XML layer on top of the original format's layer (such as TMX recording RTF or richly formatted text) creates verbosity that borders on silliness. Wordfast opted for an agile format with a footprint 10 to 15 times smaller than TMX.

___________________________________________________________________________

Ms-Office, Word, Excel and PowerPoint are registered trademarks of Microsoft corp.

All other trademarks belong to their owners.

Wordfast Server is a product and registered trademark of:

Wordfast LLC 2711 Centerville Road,

Suite 400Wilmington ~ DE 19808, USA

Contact: info@wordfast.com