jugad2 - Vasudev Ram on software innovation: Microsoft-Word

Wednesday, October 2, 2013

Convert Microsoft Word files to PDF with DOCXtoPDF

DOCX to PDF

Building upon my recent post, here:

Extract text from Word .docx files with python-docx,

I came up with the idea of combining that DOCX text extraction functionality of python-docx with my xtopdf toolkit, to create a program that can convert the text in Microsoft Word DOCX files to PDF format.

[ Note: The conversion has some limitations. E.g. fonts, tables, etc. from the input are not preserved in the output. ]

Here is the program, called DOCXtoPDF.py. It will become a part of my xtopdf toolkit.

# DOCXtoPDF.py

# Author: Vasudev Ram - http://www.dancingbison.com
# Copyright 2012 Vasudev Ram, http://www.dancingbison.com

# This is open source code, released under the New BSD License -
# see http://www.opensource.org/licenses/bsd-license.php .

import sys
import os
import os.path
import string
from textwrap import TextWrapper
from docx import opendocx, getdocumenttext
from PDFWriter import PDFWriter

def docx_to_pdf(infilename, outfilename):

    # Extract the text from the DOCX file object infile and write it to 
    # a PDF file.

    try:
        infil = opendocx(infilename)
    except Exception, e:
        print "Error opening infilename"
        print "Exception: " + repr(e) + "\n"
        sys.exit(1)

    paragraphs = getdocumenttext(infil)

    pw = PDFWriter(outfilename)
    pw.setFont("Courier", 12)
    pw.setHeader("DOCXtoPDF - convert text in DOCX file to PDF")
    pw.setFooter("Generated by xtopdf and python-docx")
    wrapper = TextWrapper(width=70, drop_whitespace=False)

    # For Unicode handling.
    new_paragraphs = []
    for paragraph in paragraphs:
        new_paragraphs.append(paragraph.encode("utf-8"))

    for paragraph in new_paragraphs:
        lines = wrapper.wrap(paragraph)
        for line in lines:
            pw.writeLine(line)
        pw.writeLine("")

    pw.savePage()
    pw.close()
    
def usage():

    return "Usage: python DOCXtoPDF.py infile.docx outfile.txt\n"

def main():

    try:
        # Check for correct number of command-line arguments.
        if len(sys.argv) != 3:
            print "Wrong number of arguments"
            print usage()
            sys.exit(1)
        infilename = sys.argv[1]
        outfilename = sys.argv[2]

        # Check for right infilename extension.
        infile_ext = os.path.splitext(infilename)[1]
        if infile_ext.upper() != ".DOCX":
            print "Input filename extension should be .DOCX"
            print usage()
            sys.exit(1)

        # Check for right outfilename extension.
        outfile_ext = os.path.splitext(outfilename)[1]
        if outfile_ext.upper() != ".PDF":
            print "Output filename extension should be .PDF"
            print usage()
            sys.exit(1)

        docx_to_pdf(infilename, outfilename)

    except Exception, e:
        sys.stderr.write("Error: " + repr(e) + "\n")
        sys.exit(1)

if __name__ == '__main__':
    main()

# EOF

To run DOCXtoPDF, give a command of the form:

python DOCXtoPDF.py infilename.docx outfilename.pdf

After this, the text content of the DOCX file will be in the PDF file.

- Enjoy.

Read other posts about xtopdf on this blog.
Read other posts about Python on this blog.

- Vasudev Ram - Dancing Bison Enterprises

Training or consulting inquiry

Share |