MaxStocker.com   MaxStocker.com    
   
Home About Blog Stuff Contact
 
   
 

May 2009

The other day...
Posted : Thu May 28th

Added Java to Category List
Posted : Tue May 26th

Proper exception handling can't wait
Posted : Mon May 25th

Connection woes
Posted : Sat May 23rd

Why I hate Apple
Posted : Wed May 20th

Extracting DOCX content with Java
Posted : Tue May 19th
5 Comment(s)

I notice
Posted : Fri May 15th

JDBC Best Practices
Posted : Fri May 15th

Simple PHP RSS loader
Posted : Sun May 10th

MySQL and DBCP for Tomcat 5
Posted : Sat May 9th

Tracing 316.70.50.1
Posted : Mon May 4th

The flip side
Posted : Tue April 28th

Starting to irritate me
Posted : Fri April 24th

As seen on the internet
Posted : Wed April 22nd

Recent Comments

Eduardo in Extracting DOCX content with Java
on Thu January 28th

silky in An update for December
on Wed January 6th

Anonymous in Ahhhhhhh
on Sun December 27th

Sarah Welstead in Because everybody has a Mom
on Sun November 15th

amakeight in Extracting DOCX content with Java
on Thu November 5th

Max in Three IE AJAX gotchas
on Sun August 30th

Categories

Technical
64 Entries

Security
19 Entries

Java
21 Entries

Privacy
6 Entries

Database
10 Entries

Internet
52 Entries

Business
31 Entries

Site Updates
19 Entries

Personal
82 Entries

RSS Feed RSS Feed

Tag Cloud

Extracting DOCX content with Java
Posted : Tuesday May 19th, 2009

I'm trying to read DOCX (Word 2007 file format) files in Java. My goal is just to be able to index the content so I only need to be able to extract the content, without formatting concerns. I looked around a bit but Apache POI doesn't have DOCX support yet (coming soon and in theory in some sort of pre-alpha preview) and the one other tool I found has a site that's broken.

But I figured it's basically XML so how hard could it be. And the answer is that if you want to extract the raw textual content it's pretty simple.

If you don't already know a DOCX file is really a zip archive. The zip contains a number of files but the key one for content is word/document.xml Then it's simply a matter of parsing that XML document and reading all the content type tags.

So below is some code that shows how to do exactly that. Again, no formatting data is preserved and I should note that this will not pick up contents from headers and footers. Picking up header and footer content is a bit more complex because those are in different XML files and may or may not exist depending on if headers and footers actually exist in the original document.

Hope this helps someone else get started with whatever they need too.

import java.util.zip.ZipFile;
import java.util.zip.ZipException;
import java.util.zip.ZipEntry;
import java.io.InputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.FactoryConfigurationError;
import javax.xml.parsers.ParserConfigurationException;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import java.io.File;
import java.io.IOException;
import org.w3c.dom.Document;
import org.w3c.dom.DOMException;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import java.util.List;
import java.util.ArrayList;

public class DocxExtractor {
  public static void main(String args[]){
    ZipFile docxfile = null;
    try{
      docxfile = new ZipFile(args[0]);
    }catch(Exception e){
      // file corrupt or otherwise could not be found
      e.printStackTrace();
      return;
    }
    InputStream in = null;
    try{
      ZipEntry ze =
       docxfile.getEntry("word/document.xml");
      in =
       docxfile.getInputStream(ze);
    }catch(NullPointerException nulle){
      System.err.println("Expected entry word/document.xml does not exist");
      nulle.printStackTrace();
      return;
    }catch(IOException ioe){
      ioe.printStackTrace();
      return;
    }
    Document document = null;
    try{
      DocumentBuilderFactory factory =
      DocumentBuilderFactory.newInstance();
      DocumentBuilder builder =
      factory.newDocumentBuilder();
      document = builder.parse(in);
    }catch(ParserConfigurationException pce){
      pce.printStackTrace();
      return;
    }catch(SAXException sex){
      sex.printStackTrace();
      return;
    }catch(IOException ioe){
      ioe.printStackTrace();
      return;
    }finally{
      try{
        docxfile.close();
      }catch(IOException ioe){
        System.err.println("Exception closing file.");
        ioe.printStackTrace();
      }
    }
    NodeList list =
    document.getElementsByTagName("w:t");
    List<String> content = new ArrayList<String>();
    for(int i=0;i      Node aNode = list.item(i);
      content.add(aNode.getFirstChild().getNodeValue());
    }
    for(String s : content){
      System.out.println(s);
    }
  }
}

Tags

content  dox  extraction  file  format  xml 

Categories

Technical  Java 

Comments

Add your comment on this blog entry
Your Name (optional)
Your Comment
What does the following equal? 15 + 25 - 14 =

sgw555 - May 21st 2009 8:02 PM
 
Will this soon be an application that normal people can use? Like what you did for that ZIP extraction thingy?

Because not everyone gets a kick out of parsing or compiling code...


Max - May 22nd 2009 9:07 AM
 
Well to be honest it's present state it's not to useful for anyone but programmers. If you are looking for how to extract text from a DOCx then it's useful otherwise not so much.

I am working though on a doc to pdf engine. I might wrap that up a in a GUI but a lot of desktop systems have capabilities for doing this already. I am doing it so I can do it on a server without having to open Word (or equivalent) and print/save to pdf.


nomias O. madr - Jul 23rd 2009 2:04 PM
 
Thanks in advance dude!!

It works greatly inside my application.
I was looking for something like that for long time.

Regards


Eduardo - Jan 28th 2010 3:24 PM
 
You may find interesting javadocx. It has a LGPL version that you may download from http://www.javadocx.com


 
   
  Follow me on Twitter   My Facebook Profile   My LinkedIn Profile   RSS feed of my blog Home   |   About   |   Blog   |   Stuff   |   Contact   |   Privacy Policy  
   
  © 2008 Max Stocker