![]() |
![]() |
|||||||||||||||
|
||||||||||||||||
|
![]() |
Extracting DOCX content with Java
I'm trying to read DOCX (Word 2007 file format) files in Java. My goal is just to be able to index the content so I only need to be able to extract the content, without formatting concerns. I looked around a bit but Apache POI doesn't have DOCX support yet (coming soon and in theory in some sort of pre-alpha preview) and the one other tool I found has a site that's broken. But I figured it's basically XML so how hard could it be. And the answer is that if you want to extract the raw textual content it's pretty simple. If you don't already know a DOCX file is really a zip archive. The zip contains a number of files but the key one for content is word/document.xml Then it's simply a matter of parsing that XML document and reading all the content type tags. So below is some code that shows how to do exactly that. Again, no formatting data is preserved and I should note that this will not pick up contents from headers and footers. Picking up header and footer content is a bit more complex because those are in different XML files and may or may not exist depending on if headers and footers actually exist in the original document. Hope this helps someone else get started with whatever they need too. import java.util.zip.ZipFile;import java.util.zip.ZipException; import java.util.zip.ZipEntry; import java.io.InputStream; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.FactoryConfigurationError; import javax.xml.parsers.ParserConfigurationException; import org.xml.sax.SAXException; import org.xml.sax.SAXParseException; import java.io.File; import java.io.IOException; import org.w3c.dom.Document; import org.w3c.dom.DOMException; import org.w3c.dom.NodeList; import org.w3c.dom.Node; import java.util.List; import java.util.ArrayList; public class DocxExtractor { public static void main(String args[]){ ZipFile docxfile = null; try{ docxfile = new ZipFile(args[0]); }catch(Exception e){ // file corrupt or otherwise could not be found e.printStackTrace(); return; } InputStream in = null; try{ ZipEntry ze = docxfile.getEntry("word/document.xml"); in = docxfile.getInputStream(ze); }catch(NullPointerException nulle){ System.err.println("Expected entry word/document.xml does not exist"); nulle.printStackTrace(); return; }catch(IOException ioe){ ioe.printStackTrace(); return; } Document document = null; try{ DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); document = builder.parse(in); }catch(ParserConfigurationException pce){ pce.printStackTrace(); return; }catch(SAXException sex){ sex.printStackTrace(); return; }catch(IOException ioe){ ioe.printStackTrace(); return; }finally{ try{ docxfile.close(); }catch(IOException ioe){ System.err.println("Exception closing file."); ioe.printStackTrace(); } } NodeList list = document.getElementsByTagName("w:t"); List<String> content = new ArrayList<String>(); for(int i=0;i content.add(aNode.getFirstChild().getNodeValue()); } for(String s : content){ System.out.println(s); } } } Tags content dox extraction file format xml Categories Comments sgw555 - May 21st 2009 8:02 PM Max - May 22nd 2009 9:07 AM nomias O. madr - Jul 23rd 2009 2:04 PM Eduardo - Jan 28th 2010 3:24 PM |
||||||||||||||
|
Home | About | Blog | Stuff | Contact | Privacy Policy | |||||||||||||||
| © 2008 Max Stocker | ||||||||||||||||