Java XML Parsing

XML Prepared By Srinivasan Jayakumar

Briefly: The Power of XML XML is Extensible Markup Language Text-based representation for describing data structure Both human and machine readable Originated from Standardized Generalized Markup Language (SGML) Became a World Wide Web Consortium (W3C) standard in 1998 XML is a great choice for exchanging data between disparate systems

Synergy between Java and XML Java+XML=Portable language+Portable Data Allows use Java to generate XML data Use Java to access SQL databases Use Java to format data in XML Use Java to parse data Use Java to validate data Use Java to transform data

HTML and XML HTML and XML look similar, because they are both SGML languages use elements enclosed in tags (e.g. <body>This is an element</body> ) use tag attributes (e.g., ) More precisely, HTML is defined in SGML XML is a (very small) subset of SGML

HTML and XML HTML is for humans HTML describes web pages Browsers ignore and/or correct many HTML errors, so HTML is often sloppy XML is for computers XML describes data The rules are strict and errors are not allowed In this way, XML is like a programming language Current versions of most browsers display XML

Example XML document <?xml version="1.0"?> <weatherReport> <date>7/14/97</date> <city>North Place</city>, <state>NX</state> <country>USA</country> High Temp: <high scale="F">103</high> Low Temp: <low scale="F">70</low> Morning: <morning>Partly cloudy, Hazy</morning> Afternoon: <afternoon>Sunny & amp; hot</afternoon> Evening: <evening>Clear and Cooler</evening> </weatherReport>

Overall structure An XML document may start with one or more processing instructions or directives: <?xml version="1.0"?> <?xml-stylesheet type="text/css" href="ss.css"?> Following the directives, there must be exactly one root element containing all the rest of the XML: <weatherReport> ... </weatherReport>

XML building blocks Aside from the directives, an XML document is built from: elements: high in < high scale="F">103</ high > tags, in pairs: <high scale="F"> 103 </high> attributes: <high scale="F" >103</high> entities: <afternoon>Sunny & amp; hot</afternoon> data: <high scale="F"> 103 </high>

Elements and attributes Attributes and elements are interchangeable Example: Elements are easier to use from Java Attributes may contain elaborate metadata, such as unique IDs <name> <first>David</first> <last>Smith</last> </name> <name first="David" last= " Smith"> </name>

Well-formed XML In XML, every element must have both a start tag and an end tag, e.g. <name> ... </name> Empty elements can be abbreviated: <break /> . XML tags are case sensitive and may not begin with the letters xml , in any combination of cases Elements must be properly nested e.g. not bold and italic XML document must have one and only one root element The values of attributes must be enclosed in quotes e.g. <time unit="days">

XML as a tree An XML document represents a hierarchy A hierarchy is a tree novel foreword chapter number="1" paragraph paragraph paragraph This is the great American novel. It was a dark and stormy night. Suddenly, a shot rang out!

Viewing XML XML is designed to be processed by computer programs, not to be displayed to humans Nevertheless, almost all current Web browsers can display XML documents They do not all display it the same way They may not display it at all if it has errors This is just an added value. Remember: HTML is designed to be viewed, XML is designed to be used

Stream Model Stream seen by parser is a sequence of elements As each XML element is seen, an event occurs Some code registered with the parser (the event handler) is executed This approach is popularized by the Simple API for XML (SAX) Problem: Hard to get a global view of the document Parsing state represented by global variables set by the event handlers

Data Model The XML data is transformed into a navigable data structure in memory Because of the nesting of XML elements, a tree data structure is used The tree is navigated to discover the XML document This approach is popularized by the Document Object Model (DOM) Problem: May require large amounts of memory May not be as fast as stream approach Some DOM parsers use SAX to build the tree

SAX and DOM SAX and DOM are standards for XML parsers DOM is a W3C standard SAX is an ad-hoc (but very popular) standard There are various implementations available Java implementations are provided as part of JAXP ( Java API for XML Processing ) JAXP package is included in JDK starting from JDK 1.4 Is available separately for Java 1.3

Difference between SAX and DOM DOM reads the entire document into memory and stores it as a tree data structure SAX reads the document and calls handler methods for each element or block of text that it encounters Consequences: DOM provides "random access" into the document SAX provides only sequential access to the document DOM is slow and requires huge amount of memory, so it cannot be used for large documents SAX is fast and requires very little memory, so it can be used for huge documents This makes SAX much more popular for web sites

Parsing with SAX SAX uses the source-listener-delegate model for parsing XML documents Source is XML data consisting of a XML elements A listener written in Java is attached to the document which listens for an event When event is thrown, some method is delegated for handling the code

SAX Parsing: process XML as Stream

Simple SAX program The program consists of two classes: Sample -- This class contains the main method; it Gets a factory to make parsers Gets a parser from the factory Creates a Handler object to handle callbacks from the parser Tells the parser which handler to send its callbacks to Reads and parses the input XML file Handler -- This class contains handlers for three kinds of callbacks: startElement callbacks, generated when a start tag is seen endElement callbacks, generated when an end tag is seen characters callbacks, generated for the contents of an element

The Sample class import javax.xml.parsers.*; // for both SAX and DOM import org.xml.sax.*; import org.xml.sax.helpers.*; // For simplicity, we let the operating system handle exceptions // In "real life" this is poor programming practice public class Sample { public static void main(String args[]) throws Exception { // Create a parser factory SAXParserFactory factory = SAXParserFactory.newInstance(); // Tell factory that the parser must understand namespaces factory.setNamespaceAware(true); // Make the parser SAXParser saxParser = factory.newSAXParser(); XMLReader parser = saxParser.getXMLReader();

The Sample class // Create a handler Handler handler = new Handler(); // Tell the parser to use this handler parser.setContentHandler(handler); // Finally, read and parse the document parser.parse("hello.xml"); } // end of Sample class The parser reads the file hello.xml It should be located In the same directory In a directory that is included in the classpath

The Handler class public class Handler extends DefaultHandler { DefaultHandler is an adapter class that defines empty methods to be overridden We define 3 methods to handle (1) start tags, (2) contents, and (3) end tags. The methods will just print a line Each of these 3 methods throws a SAXException // SAX calls this when it encounters a start tag public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes attributes) throws SAXException { System.out.println("startElement: " + qualifiedName); }

The Handler class // SAX calls this method to pass in character data public void characters(char ch[ ], int start, int length) throws SAXException { System.out.println("characters: \"" + new String(ch, start, length) + "\""); } // SAX call this method when it encounters an end tag public void endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException { System.out.println("Element: /" + qualifiedName); } } // End of Handler class

Results If the file hello.xml contains: <?xml version="1.0"?> <display>Hello World!</display> Then the output from running java Sample will be: startElement: display characters: "Hello World!" Element: /display

More results Now suppose the file hello.xml contains : <?xml version="1.0"?> <display> Hello World! </display> Notice that the root element, <display> , contains a nested element and whitespace (including newlines) The result will be as shown at the right: startElement: display characters: "" characters: " " characters: " " startElement: i characters: "Hello" endElement: /i characters: "World!" characters: " " endElement: /display // empty string // newline // spaces // another newline

Factories SAX uses a parser factory A factory is a design pattern alternative to constructors Factories allow the programmer to: Decide whether or not to create a new object Decide what kind of object to create class TrustMe { private TrustMe() { } // private constructor public TrustMe makeTrust() { // factory method if ( /* test of some sort */) return new TrustMe(); } } }

Parser factories To create a SAX parser factory, call static method: SAXParserFactory.newInstance() Returns an object of type SAXParserFactory It may throw a FactoryConfigurationError Then, the parser can be customized: public void setNamespaceAware(boolean awareness) Call this with true if you are using namespaces The default (if you don’t call this method) is false public void setValidating(boolean validating) Call this with true if you want to validate against a DTD The default (if you don’t call this method) is false Validation will give an error if you do not have a DTD

Getting a parser Once a SAXParserFactory factory was set up, parsers can be created with: SAXParser saxParser = factory.newSAXParser(); XMLReader parser = saxParser.getXMLReader(); Note: SAXParser is not thread-safe If a parser will be used by in multiple threads, create a separate SAXParser object for each thread

Declaring which handler to use Since the SAX parser will call the handlers, we need to supply these methods Binding the parser with a handler: Handler handler = new Handler(); parser.setContentHandler(handler); These statements could be combined: parser.setContentHandler(new Handler()); Finally, the parser is invoked on the file to parse: parser.parse("hello.xml"); Everything else is done in the handler methods

SAX handlers A callback handler must implement 4 interfaces: interface ContentHandler Handles basic parsing callbacks, e.g., element starts and ends interface DTDHandler Handles only notation and unparsed entity declarations interface EntityResolver Does customized handling for external entities interface ErrorHandler Must be implemented or parsing errors will be ignored! Implementing all these interfaces is a lot of work It is easier to use an adapter class

Class DefaultHandler DefaultHandler is in an adapter from package org.xml.sax.helpers DefaultHandler implements ContentHandler , DTDHandler , EntityResolver , and ErrorHandler DefaultHandler provides empty methods for every method declared in each of the interfaces To use this class, extend it and override the methods that are important to the application

ContentHandler methods public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException This method is called at the beginning of elements When SAX calls startElement , it passes in a parameter of type Attributes The following methods look up attributes by name rather than by index: public int getIndex(String qualifiedName) public int getIndex(String uri, String localName) public String getValue(String qualifiedName) public String getValue(String uri, String localName)

ContentHandler methods endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException The parameters to endElement are the same as those to startElement , except that the Attributes parameter is omitted public void characters(char[] ch, int start, int length) throws SAXException ch is an array of characters Only length characters, starting from ch[start] , are the contents of the element

Error Handling SAX error handling is unusual Most errors are ignored unless you an error handler org.xml.sax.ErrorHandler is registered Ignored errors can cause unexpected behavior The ErrorHandler interface declares: public void fatalError (SAXParseException exception) throws SAXException // XML not well structured public void error (SAXParseException exception) throws SAXException // XML validation error public void warning (SAXParseException exception) throws SAXException // minor problem

External parsers Alternatively, you can use an existing parser: Xerces, Electric XML, Expat, MSXML, CMarkup Stages of the parsing Get the URL object for the source Create InputSource object encapsulating the data source Create the parser Launch the parser on the data source

Problems with SAX SAX provides only sequential access to the document being processed SAX has only a local view of the current element being processed Global knowledge of parsing must be stored in global variables A single startElement() method for all elements In startElement() there are many “if-then-else” tests for checking a specific element When an element is seen, a global flag is set When finished with the element global flag must be set to false

DOM DOM represents the XML document as a tree Hierarchical nature of tree maps well to hierarchical nesting of XML elements Tree contains a global view of the document Makes navigation of document easy Allows to modify any subtree Easier processing than SAX but memory intensive! As well as SAX, DOM is an API only Does not specify a parser Lists the API and requirements for the parser DOM parsers typically use SAX parsing

DOM Parsing: process entire document

Simple DOM program First we need to create a DOM parser, called a DocumentBuilder The parser is created, not by a constructor, but by calling a static factory method DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder();

Simple DOM program An XML file hello.xml will be be parsed <?xml version="1.0"?> <display>Hello World!</display> To read this file, we add the following line : Document document = builder.parse("hello.xml"); document contains the entire XML file as a tree The following code finds the content of the root element and prints it Element root = document.getDocumentElement(); Node textNode = root.getFirstChild(); System.out.println(textNode.getNodeValue()); The output of the program is: Hello World!

Reading in the tree The parse method reads in the entire XML document and represents it as a tree in memory For a large document, parsing could take a while If you want to interact with your program while it is parsing, you need to use parser in a separate thread Practically, an XML parse tree may require up to 10 times memory as the original XML document If you have a lot of tree manipulation to do, DOM is much more convenient than SAX If you do not have a lot of tree manipulation to do, consider using SAX instead

Structure of the DOM tree The DOM tree is composed of Node objects Node is an interface Some of the more important sub-interfaces are Element , Attr , and Text An Element node may have children Attr and Text nodes are the leaves of the tree Hence, the DOM tree is composed of Node objects Node objects can be downcast into specific types if needed

Operations on Node s The results returned by getNodeName() , getNodeValue() , getNodeType() and getAttributes() depend on the subtype of the node, as follows: Element Text Attr getNodeName() getNodeValue() getNodeType() getAttributes() tag name null ELEMENT_NODE NamedNodeMap "#text" text contents TEXT_NODE null name of attribute value of attribute ATTRIBUTE_NODE null

Distinguishing Node types An easy way to handle different types of nodes: switch(node.getNodeType()) { case Node.ELEMENT_NODE: Element element = (Element)node; ...; break; case Node.TEXT_NODE: Text text = (Text)node; ... break; case Node.ATTRIBUTE_NODE: Attr attr = (Attr)node; ... break; default: ... }

Operations on Node s Tree-walking methods that return a Node : getParentNode() getFirstChild() getNextSibling() getPreviousSibling() getLastChild() Test methods that return a boolean : hasAttributes() hasChildNodes()

Operations for Element s String getTagName() Returns the name of the tag boolean hasAttribute(String name) Returns true if this Element has the named attribute String getAttribute(String name) Returns the value of the named attribute boolean hasAttributes() Returns true if this Element has any attributes NamedNodeMap getAttributes() Returns a NamedNodeMap of all the Element’s attributes

Operations on Text s Text is a subinterface of CharacterData and inherits the following operations (among others): public String getData() throws DOMException Returns the text contents of this Text node public int getLength() Returns the number of Unicode characters in the text public String substringData(int offset, int count) throws DOMException Returns a substring of the text contents

Operations on Attribute s String getName() Returns the name of this attribute. Element getOwnerElement() Returns the Element node this attribute is attached to String getValue() Returns the value of the attribute as a String

Overview DOM, unlike SAX, gives allows to create and modify XML trees There are three basic kinds of operations: Creating a new DOM Modifying the structure of a DOM Modifying the content of a DOM Creating a new DOM requires a few extra methods just to get started Afterwards, you can add elements through modifying its structure and contents

Creating a new DOM import javax.xml.parsers.*; import org.w3c.dom.Document; … try { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.newDocument(); } catch (ParserConfigurationException e) { ... }

Creating structure The following are instance methods of Document : public Element createElement(String tagName) public Element createElementNS(String namespaceURI, String qualifiedName) public Attr createAttribute(String name) public Attr createAttributeNS(String namespaceURI, String qualifiedName) public ProcessingInstruction createProcessingInstruction (String target, String data) public EntityReference createEntityReference(String name) public Text createTextNode(String data) public Comment createComment(String data)

Methods of Node public Node appendChild(Node newChild) public Node insertBefore(Node newChild, Node refChild) public Node removeChild(Node oldChild) public Node replaceChild(Node newChild, Node oldChild) setNodeValue(String nodeValue) Functionality depends on the type of the node

Methods of Element public void setAttribute(String name, String value) public Attr setAttributeNode(Attr newAttr) public void setAttributeNodeNS(String namespaceURI, String qualifiedName, String value) public Attr setAttributeNodeNS(Attr newAttr) public void removeAttribute(String name) public void removeAttributeNS(String namespaceURI, String localName) public Attr removeAttributeNode(Attr oldAttr)

Method of Attribute public void setValue(String value) This is the only method that modifies an Attribute The rest just retrieve information

Java XML Parsing

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Java XML Parsing (20)

Java XML Parsing