By Vasudev Ram
DOCX to PDF
Extract text from Word .docx files with python-docx,
I came up with the idea of combining that DOCX text extraction functionality of python-docx with my xtopdf toolkit, to create a program that can convert the text in Microsoft Word DOCX files to PDF format.
[ Note: The conversion has some limitations. E.g. fonts, tables, etc. from the input are not preserved in the output. ]
Here is the program, called DOCXtoPDF.py. It will become a part of my xtopdf toolkit.
# DOCXtoPDF.py # Author: Vasudev Ram - http://www.dancingbison.com # Copyright 2012 Vasudev Ram, http://www.dancingbison.com # This is open source code, released under the New BSD License - # see http://www.opensource.org/licenses/bsd-license.php . import sys import os import os.path import string from textwrap import TextWrapper from docx import opendocx, getdocumenttext from PDFWriter import PDFWriter def docx_to_pdf(infilename, outfilename): # Extract the text from the DOCX file object infile and write it to # a PDF file. try: infil = opendocx(infilename) except Exception, e: print "Error opening infilename" print "Exception: " + repr(e) + "\n" sys.exit(1) paragraphs = getdocumenttext(infil) pw = PDFWriter(outfilename) pw.setFont("Courier", 12) pw.setHeader("DOCXtoPDF - convert text in DOCX file to PDF") pw.setFooter("Generated by xtopdf and python-docx") wrapper = TextWrapper(width=70, drop_whitespace=False) # For Unicode handling. new_paragraphs = [] for paragraph in paragraphs: new_paragraphs.append(paragraph.encode("utf-8")) for paragraph in new_paragraphs: lines = wrapper.wrap(paragraph) for line in lines: pw.writeLine(line) pw.writeLine("") pw.savePage() pw.close() def usage(): return "Usage: python DOCXtoPDF.py infile.docx outfile.txt\n" def main(): try: # Check for correct number of command-line arguments. if len(sys.argv) != 3: print "Wrong number of arguments" print usage() sys.exit(1) infilename = sys.argv[1] outfilename = sys.argv[2] # Check for right infilename extension. infile_ext = os.path.splitext(infilename)[1] if infile_ext.upper() != ".DOCX": print "Input filename extension should be .DOCX" print usage() sys.exit(1) # Check for right outfilename extension. outfile_ext = os.path.splitext(outfilename)[1] if outfile_ext.upper() != ".PDF": print "Output filename extension should be .PDF" print usage() sys.exit(1) docx_to_pdf(infilename, outfilename) except Exception, e: sys.stderr.write("Error: " + repr(e) + "\n") sys.exit(1) if __name__ == '__main__': main() # EOF
To run DOCXtoPDF, give a command of the form:
python DOCXtoPDF.py infilename.docx outfilename.pdf
After this, the text content of the DOCX file will be in the PDF file.
- Enjoy.
Read other posts about xtopdf on this blog.
Read other posts about Python on this blog.
- Vasudev Ram - Dancing Bison Enterprises
Training or consulting inquiry