Can you translate PDF files with your workflow?

Yes, but it requires additional steps, time, and cost.

The AI pipeline thrives on clean, editable, well-formatted text. Converting source files from one format to another falls outside the scope of the basic service. Because there is currently no automated method to translate a PDF without compromising output quality (this applies to both direct CAT-tool functionality and the more common practice of saving a PDF to a Word file), additional work is required at the preparation stage, during the MT processing inside the CAT tool, during post-edited translation processing inside the CAT tool, and at the final review stage. Depending on the condition of the original PDF files, the processing of these PDF files can still be viable for a high-quality MT workflow, or they may not.

For this reason, you should always push your client hard to provide native, non-PDF source files. Remind them that doing so will save them significant amounts of money, deliver a better final product, and reduce turnaround times.

However, if native files are impossible to get and we are stuck with PDFs, here is how the workflow goes.

1. Before starting (your responsibility)

You convert the PDFs to Word before sending them to me. Simply saving to Word will result in formatting anomalies proportional to the complexity of the source. Even under ideal circumstances, this raw Word file will be riddled with strange formatting, broken segmentation, and unnecessary inline tags. If the PDF was a scanned document, it will also have OCR errors. These issues ruin CAT tool leverage, leading to higher post-editing word counts and internal inconsistencies.

The more you clean up this Word file in advance, the less you will pay for MT processing and post-editing later. This is especially critical for multi-language projects, where unfixed source problems multiply across every target language. Because preparing Word files from PDFs is not included in my services, you must arrange for a file conversion or DTP service to process the PDFs into the cleanest Word files you can get before handing them off to me.

2. Preparing the MT (my responsibility)

No matter how good your conversion service is, the files I receive will still have segmentation and tag problems. I run an automated process to strip most excess tags, then go through the document segment-by-segment to manually split and join broken segments so the post-editor receives clean, readable text. Sometimes this involves the time-consuming task of joining non-contiguous segments (which then have to be re-separated out after post-editing).

I do not fix OCR errors in the source. Fortunately, the AI is surprisingly good at translating straight through light OCR mistakes by using surrounding context. But PDF formatting and OCR errors introduce random variations that hide duplicate segments and high-value fuzzy matches from standard CAT tool analysis. I use specialized techniques to hunt these down, lock them out, and ensure they are only translated once. This can drastically reduce your billable post-editing word count. Because this TM prep is highly time-consuming but generates significant financial savings for you, I split the difference by charging a separate fee for a portion of the savings generated by this effort.

Also, as a final hurdle that PDFs files place in front of us, the process of pre-processing all these PDF files results in the source files invariably being provided on a rolling basis, rather than all at once at the beginning. This also creates a small additional burden on the workflow, especially the process of locking out hidden duplicates and internal fuzzy matches.

3. Post-processing (my responsibility)

Once the translator returns the post-edited files, I save everything to the TM. I run an automated pre-translation for the standard duplicates, and manually pre-translate the hidden duplicates and fuzzy matches I isolated earlier. Finally, I export the text back to a Word file and manually correct any remaining non-contiguous segment breaks.

I do not do any further manual formatting. This is the file I deliver back to you.

4. The final polish (your responsibility)

This final file requires one last pass by your DTP or conversion service, and potentially a final QA by your post-editors. Because all non-translatable text (like numbers, codes, and terms that remain in the source language) is locked out during post-editing, any original OCR errors hidden within those numbers will remain in the output. Your team must compare these against the source PDF and fix the final layout before delivering to the end client. At this point, you can then deliver Word files, or save back to PDF and deliver those.

The bottom line

The four major PDF hurdles—formatting, segmenting, tags, and OCR—multiply the workload. It slows me down significantly, and you will also need DTP support at both the beginning and the end. Depending on the condition of the original PDF, my costs will generally be 2 to 3 times higher than the base rate for MT-processing a native source file. However, thanks to my deep TM preparation, your actual post-editing costs will likely drop lower than the initial CAT-tool analysis suggests, giving you a possible windfall to partially offset the extra work.

Because every PDF is a unique challenge, let’s pow-wow at the start of the project to map out the most efficient and cost-effective strategy.