SlideShare a Scribd company logo
XML Prepared By Srinivasan Jayakumar
Briefly: The Power of XML XML is Extensible Markup Language Text-based representation for describing data structure Both human and machine readable Originated from Standardized Generalized Markup Language (SGML) Became a World Wide Web Consortium (W3C) standard in 1998 XML is a great choice for exchanging data between  disparate  systems
Synergy between Java and XML Java+XML=Portable language+Portable Data Allows use Java to generate XML data Use Java to access SQL databases Use Java to format data in XML Use Java to parse data Use Java to validate data Use Java to transform data
HTML and XML HTML and XML look similar, because they are both SGML languages  use elements enclosed in tags (e.g.  <body>This is an element</body> ) use tag attributes (e.g., <font face=&quot;Verdana&quot; size=&quot;+1&quot; color=&quot;red&quot;> ) More precisely, HTML is defined in SGML XML is a (very small) subset of SGML
HTML and XML HTML is for humans HTML describes web pages Browsers ignore and/or correct many HTML errors, so HTML is often sloppy XML is for computers XML describes data The rules are strict and errors are not allowed In this way, XML is like a programming language Current versions of most browsers display XML
Example XML document <?xml version=&quot;1.0&quot;?> <weatherReport> <date>7/14/97</date> <city>North Place</city>, <state>NX</state> <country>USA</country> High Temp: <high scale=&quot;F&quot;>103</high> Low Temp: <low scale=&quot;F&quot;>70</low> Morning: <morning>Partly cloudy, Hazy</morning> Afternoon: <afternoon>Sunny  & amp; hot</afternoon> Evening: <evening>Clear and Cooler</evening> </weatherReport>
Overall structure  An XML document may start with one or more processing instructions or directives: <?xml version=&quot;1.0&quot;?> <?xml-stylesheet type=&quot;text/css&quot; href=&quot;ss.css&quot;?> Following the directives, there must be exactly one root element containing all the rest of the XML: <weatherReport>   ... </weatherReport>
XML building blocks Aside from the directives, an XML document is built from: elements:  high  in  < high  scale=&quot;F&quot;>103</ high > tags, in pairs:  <high scale=&quot;F&quot;> 103 </high> attributes:  <high  scale=&quot;F&quot; >103</high> entities:  <afternoon>Sunny  & amp;  hot</afternoon> data:  <high scale=&quot;F&quot;> 103 </high>
Elements and attributes Attributes and elements are interchangeable Example: Elements are easier to use from Java Attributes may contain elaborate metadata, such as unique IDs <name>  <first>David</first>  <last>Smith</last> </name> <name first=&quot;David&quot;    last= &quot; Smith&quot;> </name>
Well-formed XML In XML, every element must have both a start tag and an end tag, e.g.  <name> ... </name> Empty elements can be abbreviated:  <break /> . XML tags are case sensitive   and may not begin with the letters  xml , in any combination of cases Elements must be properly nested e.g. not  <b><i>bold and italic</b></i> XML document must have one and only one root element The values of attributes must be enclosed in quotes e.g.  <time unit=&quot;days&quot;>
XML as a tree An XML document represents a hierarchy A hierarchy is a tree novel foreword chapter number=&quot;1&quot; paragraph paragraph paragraph This is the great American novel. It was a dark and stormy night. Suddenly, a shot rang out!
Viewing XML XML is designed to be processed by computer programs, not to be displayed to humans Nevertheless, almost all current Web browsers can display XML documents They do not all display it the same way They may not display it at all if it has errors This is just an added value. Remember:   HTML is designed to be viewed,   XML is designed to be used
XML Parsers
Stream Model Stream seen by parser is a sequence of elements As each XML element is seen, an event occurs Some code registered with the parser (the event handler) is executed This approach is popularized by the Simple API for XML (SAX) Problem: Hard to get a global view of the document Parsing state represented by global variables set by the event handlers
Data Model The XML data is transformed into a navigable data structure in memory Because of the nesting of XML elements, a tree data structure is used The tree is navigated to discover the XML document This approach is popularized by the Document Object Model (DOM) Problem: May require large amounts of memory May not be as fast as stream approach Some DOM parsers use SAX to build the tree
SAX and DOM SAX and DOM are standards for XML  parsers DOM is a W3C standard SAX is an ad-hoc (but very popular) standard There are various implementations available Java implementations are provided as part of  JAXP  ( Java API for XML Processing ) JAXP package is included in JDK starting from JDK 1.4 Is available separately for Java 1.3
Difference between SAX and DOM DOM reads the entire document into memory and stores it as a tree data structure SAX reads the document and calls handler methods for each element or block of text that it encounters Consequences: DOM provides &quot;random access&quot; into the document SAX provides only sequential access to the document DOM is slow and requires huge amount of memory, so it cannot be used for large documents SAX is fast and requires very little memory, so it can be used for huge documents This makes SAX much more popular for web sites
SAX Parsing
Parsing with SAX SAX uses the source-listener-delegate model for parsing XML documents Source is XML data consisting of a XML elements A listener written in Java is attached to the document which listens for an event When event is thrown, some method is delegated for handling the code
SAX Parsing: process XML as Stream
Simple SAX program The program consists of two classes: Sample  -- This class contains the  main  method; it Gets a factory to make parsers Gets a parser from the factory Creates a  Handler  object to handle callbacks from the parser Tells the parser which handler to send its callbacks to Reads and parses the input XML file Handler  -- This class contains handlers for three kinds of callbacks: startElement  callbacks, generated when a start tag is seen endElement  callbacks, generated when an end tag is seen characters  callbacks, generated for the contents of an element
The  Sample  class import javax.xml.parsers.*; // for both SAX and DOM import org.xml.sax.*; import org.xml.sax.helpers.*; // For simplicity, we let the operating system handle exceptions // In &quot;real life&quot; this is poor programming practice public class Sample {   public static void main(String args[]) throws Exception { // Create a parser factory   SAXParserFactory factory =  SAXParserFactory.newInstance(); // Tell factory that the parser must understand namespaces   factory.setNamespaceAware(true); // Make the parser   SAXParser saxParser = factory.newSAXParser();   XMLReader parser = saxParser.getXMLReader();
The  Sample  class   // Create a handler   Handler handler = new Handler(); // Tell the parser to use this handler   parser.setContentHandler(handler); // Finally, read and parse the document   parser.parse(&quot;hello.xml&quot;); } // end of  Sample class The parser reads the file  hello.xml It should be located In the same directory In a directory that is included in the classpath
The  Handler  class public class Handler extends DefaultHandler { DefaultHandler  is an adapter class that defines empty methods to be overridden We define 3 methods to handle (1) start tags, (2) contents, and (3) end tags. The methods will just print a line Each of these 3 methods throws a  SAXException // SAX calls this when it encounters a start tag   public void startElement(String namespaceURI,   String localName, String qualifiedName,   Attributes attributes) throws SAXException {   System.out.println(&quot;startElement: &quot; + qualifiedName);   }
The  Handler  class // SAX calls this method to pass in character data   public void characters(char ch[ ], int start, int length)   throws SAXException {   System.out.println(&quot;characters: \&quot;&quot; +   new String(ch, start, length) + &quot;\&quot;&quot;);   } // SAX call this method when it encounters an end tag   public void endElement(String namespaceURI,   String localName,   String qualifiedName)   throws SAXException {   System.out.println(&quot;Element: /&quot; + qualifiedName);   } } // End of Handler class
Results If the file  hello.xml  contains:   <?xml version=&quot;1.0&quot;?>   <display>Hello World!</display> Then the output from running  java Sample  will be:   startElement: display   characters: &quot;Hello World!&quot;   Element: /display
More results Now suppose the file  hello.xml   contains : <?xml version=&quot;1.0&quot;?> <display>   <i>Hello</i> World! </display> Notice that the root element,  <display> , contains a nested element  <i>  and whitespace (including newlines) The result will be as shown at the right: startElement: display characters: &quot;&quot; characters: &quot; &quot;  characters: &quot;  &quot;  startElement: i characters: &quot;Hello&quot; endElement: /i characters:  &quot;World!&quot; characters: &quot;  &quot; endElement: /display // empty string // newline // spaces // another newline
Factories SAX uses a parser factory A factory is a design pattern alternative to constructors Factories allow the programmer to: Decide whether or not to create a new object Decide what kind of object to create class TrustMe {   private TrustMe() { } // private constructor   public TrustMe makeTrust() { // factory method   if ( /* test of some sort */)   return new TrustMe();   }   } }
Parser factories To create a SAX parser factory, call static method: SAXParserFactory.newInstance() Returns an object of type  SAXParserFactory It may throw a  FactoryConfigurationError Then, the parser can be customized: public void setNamespaceAware(boolean awareness) Call this with  true  if you are using namespaces The default (if you don’t call this method) is  false public void setValidating(boolean validating) Call this with  true  if you want to validate against a DTD The default (if you don’t call this method) is  false Validation will give an error if you do not have a DTD
Getting a parser Once a  SAXParserFactory   factory  was set up, parsers can be created with:   SAXParser saxParser = factory.newSAXParser();   XMLReader parser = saxParser.getXMLReader(); Note:  SAXParser  is not thread-safe If a parser will be used by in multiple threads, create a separate  SAXParser  object for each thread
Declaring which handler to use Since the SAX parser will call the handlers, we need to supply these methods Binding the parser with a handler:   Handler handler = new Handler();   parser.setContentHandler(handler); These statements could be combined:   parser.setContentHandler(new Handler()); Finally, the parser is invoked on the file to parse:   parser.parse(&quot;hello.xml&quot;); Everything else is done in the handler methods
SAX handlers A callback handler must implement 4 interfaces: interface ContentHandler Handles basic parsing callbacks, e.g., element starts and ends interface DTDHandler Handles  only  notation and unparsed entity declarations interface EntityResolver Does customized handling for external entities interface ErrorHandler Must be implemented or parsing errors will be  ignored! Implementing all these interfaces is a lot of work It is easier to use an adapter class
Class  DefaultHandler DefaultHandler  is in an adapter from package  org.xml.sax.helpers  DefaultHandler  implements  ContentHandler ,  DTDHandler ,  EntityResolver , and  ErrorHandler   DefaultHandler  provides empty methods for every method declared in each of the interfaces To use this class, extend it and override the methods that are important to the application
ContentHandler  methods public void startElement(String namespaceURI,   String localName, String qualifiedName,   Attributes atts) throws SAXException This method is called at the beginning of elements When SAX calls  startElement ,   it passes in a parameter of type  Attributes The following methods look up attributes by name rather than by index: public int getIndex(String qualifiedName) public int getIndex(String uri, String localName) public String getValue(String qualifiedName) public String getValue(String uri, String localName)
ContentHandler  methods endElement(String namespaceURI,   String localName, String qualifiedName)   throws SAXException The parameters to  endElement  are the same as those to  startElement , except that the  Attributes  parameter is omitted public void characters(char[] ch, int start, int length)  throws SAXException ch  is an array of characters Only  length  characters, starting from  ch[start] , are the contents of the element
Error Handling SAX error handling is unusual Most errors are ignored unless you an error handler  org.xml.sax.ErrorHandler  is registered  Ignored errors can cause unexpected behavior The  ErrorHandler  interface declares: public void  fatalError  (SAXParseException exception)   throws SAXException  // XML not well structured public void  error  (SAXParseException exception)   throws SAXException  // XML validation error public void  warning  (SAXParseException exception)   throws SAXException  // minor problem
External parsers Alternatively, you can use an existing parser: Xerces, Electric XML, Expat, MSXML, CMarkup Stages of the parsing Get the URL object for the source Create InputSource object encapsulating the data source Create the parser Launch the parser on the data source
Problems with SAX SAX provides only sequential access to the document being processed SAX has only a local view of the current element being processed Global knowledge of parsing must be stored in global variables A single startElement() method for all elements In startElement() there are many “if-then-else” tests for checking a specific element When an element is seen, a global flag is set When finished with the element global flag must be set to false
DOM Parsing
DOM DOM represents the XML document as a tree Hierarchical nature of tree maps well to hierarchical nesting of XML elements Tree contains a global view of the document Makes navigation of document easy Allows to modify any subtree Easier processing than SAX but memory intensive! As well as SAX, DOM is an API only Does not specify a parser Lists the API and requirements for the parser DOM parsers typically use SAX parsing
DOM Parsing: process entire document
Simple DOM program First we need to create a DOM parser, called a DocumentBuilder The parser is created, not by a constructor, but by calling a static factory method DocumentBuilderFactory factory =   DocumentBuilderFactory.newInstance(); DocumentBuilder builder =   factory.newDocumentBuilder();
Simple DOM program An XML file  hello.xml  will be be parsed   <?xml version=&quot;1.0&quot;?>   <display>Hello World!</display> To read this file, we add the following line :   Document document = builder.parse(&quot;hello.xml&quot;); document  contains the entire XML file as a tree  The following code finds the content of the root element and prints it   Element root = document.getDocumentElement();   Node textNode = root.getFirstChild();   System.out.println(textNode.getNodeValue()); The output of the program is:  Hello World!
Reading in the tree The  parse  method reads in the entire XML document and represents it as a tree in memory For a large document, parsing could take a while If you want to interact with your program while it is parsing, you need to use parser in a separate thread Practically, an XML parse tree may require up to 10 times memory as the original XML document If you have a lot of tree manipulation to do, DOM is much more convenient than SAX If you do not have a lot of tree manipulation to do, consider using SAX instead
Structure of the DOM tree The DOM tree is composed of  Node  objects Node  is an interface Some of the more important sub-interfaces are  Element ,  Attr , and  Text An  Element  node may have children Attr  and  Text  nodes are the leaves of the tree Hence, the DOM tree is composed of  Node  objects  Node  objects can be downcast into specific types if needed
Operations on  Node s The results returned by  getNodeName() ,  getNodeValue() ,  getNodeType()  and  getAttributes()  depend on the subtype of the node, as follows:   Element  Text  Attr getNodeName() getNodeValue() getNodeType() getAttributes() tag name null ELEMENT_NODE NamedNodeMap &quot;#text&quot; text contents   TEXT_NODE null  name of attribute   value of attribute   ATTRIBUTE_NODE null
Distinguishing  Node  types An easy way to handle different types of nodes: switch(node.getNodeType()) { case Node.ELEMENT_NODE: Element element = (Element)node; ...; break; case Node.TEXT_NODE: Text text = (Text)node; ... break; case Node.ATTRIBUTE_NODE: Attr attr = (Attr)node; ... break; default: ... }
Operations on  Node s Tree-walking methods that return a  Node : getParentNode() getFirstChild() getNextSibling() getPreviousSibling() getLastChild() Test methods that return a   boolean : hasAttributes() hasChildNodes()
Operations for  Element s String getTagName() Returns the name of the tag   boolean hasAttribute(String name) Returns  true  if this  Element  has the named attribute String getAttribute(String name) Returns the value of the named attribute boolean hasAttributes() Returns  true  if this  Element  has any attributes NamedNodeMap getAttributes() Returns a  NamedNodeMap  of all the  Element’s  attributes
Operations on  Text s Text  is a subinterface of  CharacterData  and inherits the following operations (among others): public String getData() throws DOMException Returns the text contents of this  Text  node public int getLength() Returns the number of Unicode characters in the text public String substringData(int offset, int count)   throws DOMException Returns a substring of the text contents
Operations on  Attribute s String getName()   Returns the name of this attribute.  Element getOwnerElement()   Returns the  Element  node this attribute is attached to String getValue()  Returns the value of the attribute as a  String
Overview DOM, unlike SAX, gives allows to create and modify XML trees There are three basic kinds of operations: Creating a new DOM Modifying the structure of a DOM Modifying the content of a DOM Creating a new DOM requires a few extra methods just to get started Afterwards, you can add elements through modifying its structure and contents
Creating a new DOM import javax.xml.parsers.*; import org.w3c.dom.Document; … try { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.newDocument(); } catch (ParserConfigurationException e) { ... }
Creating structure The following are instance methods of  Document : public Element createElement(String tagName) public Element createElementNS(String namespaceURI,   String qualifiedName) public Attr createAttribute(String name) public Attr createAttributeNS(String namespaceURI,   String qualifiedName) public ProcessingInstruction createProcessingInstruction   (String target, String data) public EntityReference createEntityReference(String name) public Text createTextNode(String data) public Comment createComment(String data)
Methods of  Node public Node appendChild(Node newChild) public Node insertBefore(Node newChild, Node  refChild) public Node removeChild(Node oldChild) public Node replaceChild(Node newChild, Node  oldChild) setNodeValue(String nodeValue) Functionality depends on the type of the node
Methods of  Element public void setAttribute(String name, String value) public Attr setAttributeNode(Attr newAttr) public void setAttributeNodeNS(String namespaceURI,   String qualifiedName, String value) public Attr setAttributeNodeNS(Attr newAttr) public void removeAttribute(String name) public void removeAttributeNS(String namespaceURI,   String localName) public Attr removeAttributeNode(Attr oldAttr)
Method of  Attribute public void setValue(String value) This is the only method that modifies an  Attribute The rest just retrieve information
Queries  ?

More Related Content

What's hot (20)

PPTX
Java and XML
Raji Ghawi
 
PPT
DOM and SAX
Jussi Pohjolainen
 
PDF
Xml And JSON Java
Henry Addo
 
PDF
Querring xml with xpath
Malintha Adikari
 
PDF
SAX, DOM & JDOM parsers for beginners
Hicham QAISSI
 
PPT
5 xml parsing
gauravashq
 
PPTX
Dom parser
sana mateen
 
PPT
Understanding XML DOM
Om Vikram Thapa
 
PPT
XML and XPath details
DSK Chakravarthy
 
PDF
Tool Development 05 - XML Schema, INI, JSON, YAML
Nick Pruehs
 
PDF
XML parsing using jaxb
Malintha Adikari
 
PPTX
Introductionto xslt
Kumar
 
PDF
Tool Development 04 - XML
Nick Pruehs
 
PPT
Jaxb
Manav Prasad
 
PPTX
Java and XML Schema
Raji Ghawi
 
PPTX
ODTUG Webcast - Thinking Clearly about XML
Marco Gralike
 
PPT
Xpath presentation
Alfonso Gabriel López Ceballos
 
PPTX
Java and SPARQL
Raji Ghawi
 
KEY
Object Relational Mapping in PHP
Rob Knight
 
Java and XML
Raji Ghawi
 
DOM and SAX
Jussi Pohjolainen
 
Xml And JSON Java
Henry Addo
 
Querring xml with xpath
Malintha Adikari
 
SAX, DOM & JDOM parsers for beginners
Hicham QAISSI
 
5 xml parsing
gauravashq
 
Dom parser
sana mateen
 
Understanding XML DOM
Om Vikram Thapa
 
XML and XPath details
DSK Chakravarthy
 
Tool Development 05 - XML Schema, INI, JSON, YAML
Nick Pruehs
 
XML parsing using jaxb
Malintha Adikari
 
Introductionto xslt
Kumar
 
Tool Development 04 - XML
Nick Pruehs
 
Java and XML Schema
Raji Ghawi
 
ODTUG Webcast - Thinking Clearly about XML
Marco Gralike
 
Java and SPARQL
Raji Ghawi
 
Object Relational Mapping in PHP
Rob Knight
 

Viewers also liked (19)

PPTX
Xml processors
Saurav Mawandia
 
PDF
XML DOM
Hoang Nguyen
 
PPT
Simple API for XML
guest2556de
 
PPTX
XML Document Object Model (DOM)
BOSS Webtech
 
PDF
eXtensible Markup Language APIs in Java 1.6 - Simple and efficient XML parsin...
Wojciech Podgórski
 
PPT
Session 1
Lại Đức Chung
 
PPTX
XSLT
Surinder Kaur
 
PPTX
XML
marwa baich
 
PPT
DOM ( Document Object Model )
ITSTB
 
PPT
XML.ppt
butest
 
PPT
Java Web Service - Summer 2004
Danny Teng
 
PPTX
An Introduction to the DOM
Mindy McAdams
 
PDF
Jaxp Xmltutorial 11 200108
nit Allahabad
 
PDF
Java Web Services [2/5]: Introduction to SOAP
IMC Institute
 
PDF
Java Web Services [5/5]: REST and JAX-RS
IMC Institute
 
PPT
Xml Java
cbee48
 
PDF
Java Web Services [3/5]: WSDL, WADL and UDDI
IMC Institute
 
PPTX
java API for XML DOM
Surinder Kaur
 
Xml processors
Saurav Mawandia
 
XML DOM
Hoang Nguyen
 
Simple API for XML
guest2556de
 
XML Document Object Model (DOM)
BOSS Webtech
 
eXtensible Markup Language APIs in Java 1.6 - Simple and efficient XML parsin...
Wojciech Podgórski
 
DOM ( Document Object Model )
ITSTB
 
XML.ppt
butest
 
Java Web Service - Summer 2004
Danny Teng
 
An Introduction to the DOM
Mindy McAdams
 
Jaxp Xmltutorial 11 200108
nit Allahabad
 
Java Web Services [2/5]: Introduction to SOAP
IMC Institute
 
Java Web Services [5/5]: REST and JAX-RS
IMC Institute
 
Xml Java
cbee48
 
Java Web Services [3/5]: WSDL, WADL and UDDI
IMC Institute
 
java API for XML DOM
Surinder Kaur
 
Ad

Similar to Java XML Parsing (20)

PDF
Processing XML
Ólafur Andri Ragnarsson
 
PPT
Processing XML with Java
BG Java EE Course
 
PDF
24sax
Adil Jafri
 
PPT
XML
thotasrinath
 
PDF
Ch23
preetamju
 
PDF
Ch23 xml processing_with_java
ardnetij
 
PDF
Understanding Sax
LiquidHub
 
PPTX
Xml and xml processor
Himanshu Soni
 
PPTX
Xml and xml processor
Himanshu Soni
 
PPT
SAX PARSER
Saranya Arunprasath
 
PPT
Introduction to xml
Shivalik college of engineering
 
DOCX
Unit 2.3
Abhishek Kesharwani
 
PPT
XML Presentation-2
Sudharsan S
 
PPTX
Unit3wt
vamsitricks
 
PPTX
Unit3wt
vamsi krishna
 
PDF
X Usax Pdf
nit Allahabad
 
PDF
Service Oriented Architecture - Unit II - Sax
Roselin Mary S
 
Processing XML
Ólafur Andri Ragnarsson
 
Processing XML with Java
BG Java EE Course
 
24sax
Adil Jafri
 
Ch23
preetamju
 
Ch23 xml processing_with_java
ardnetij
 
Understanding Sax
LiquidHub
 
Xml and xml processor
Himanshu Soni
 
Xml and xml processor
Himanshu Soni
 
SAX PARSER
Saranya Arunprasath
 
Introduction to xml
Shivalik college of engineering
 
XML Presentation-2
Sudharsan S
 
Unit3wt
vamsitricks
 
Unit3wt
vamsi krishna
 
X Usax Pdf
nit Allahabad
 
Service Oriented Architecture - Unit II - Sax
Roselin Mary S
 
Ad

Java XML Parsing

  • 1. XML Prepared By Srinivasan Jayakumar
  • 2. Briefly: The Power of XML XML is Extensible Markup Language Text-based representation for describing data structure Both human and machine readable Originated from Standardized Generalized Markup Language (SGML) Became a World Wide Web Consortium (W3C) standard in 1998 XML is a great choice for exchanging data between disparate systems
  • 3. Synergy between Java and XML Java+XML=Portable language+Portable Data Allows use Java to generate XML data Use Java to access SQL databases Use Java to format data in XML Use Java to parse data Use Java to validate data Use Java to transform data
  • 4. HTML and XML HTML and XML look similar, because they are both SGML languages use elements enclosed in tags (e.g. <body>This is an element</body> ) use tag attributes (e.g., <font face=&quot;Verdana&quot; size=&quot;+1&quot; color=&quot;red&quot;> ) More precisely, HTML is defined in SGML XML is a (very small) subset of SGML
  • 5. HTML and XML HTML is for humans HTML describes web pages Browsers ignore and/or correct many HTML errors, so HTML is often sloppy XML is for computers XML describes data The rules are strict and errors are not allowed In this way, XML is like a programming language Current versions of most browsers display XML
  • 6. Example XML document <?xml version=&quot;1.0&quot;?> <weatherReport> <date>7/14/97</date> <city>North Place</city>, <state>NX</state> <country>USA</country> High Temp: <high scale=&quot;F&quot;>103</high> Low Temp: <low scale=&quot;F&quot;>70</low> Morning: <morning>Partly cloudy, Hazy</morning> Afternoon: <afternoon>Sunny & amp; hot</afternoon> Evening: <evening>Clear and Cooler</evening> </weatherReport>
  • 7. Overall structure An XML document may start with one or more processing instructions or directives: <?xml version=&quot;1.0&quot;?> <?xml-stylesheet type=&quot;text/css&quot; href=&quot;ss.css&quot;?> Following the directives, there must be exactly one root element containing all the rest of the XML: <weatherReport> ... </weatherReport>
  • 8. XML building blocks Aside from the directives, an XML document is built from: elements: high in < high scale=&quot;F&quot;>103</ high > tags, in pairs: <high scale=&quot;F&quot;> 103 </high> attributes: <high scale=&quot;F&quot; >103</high> entities: <afternoon>Sunny & amp; hot</afternoon> data: <high scale=&quot;F&quot;> 103 </high>
  • 9. Elements and attributes Attributes and elements are interchangeable Example: Elements are easier to use from Java Attributes may contain elaborate metadata, such as unique IDs <name> <first>David</first> <last>Smith</last> </name> <name first=&quot;David&quot; last= &quot; Smith&quot;> </name>
  • 10. Well-formed XML In XML, every element must have both a start tag and an end tag, e.g. <name> ... </name> Empty elements can be abbreviated: <break /> . XML tags are case sensitive and may not begin with the letters xml , in any combination of cases Elements must be properly nested e.g. not <b><i>bold and italic</b></i> XML document must have one and only one root element The values of attributes must be enclosed in quotes e.g. <time unit=&quot;days&quot;>
  • 11. XML as a tree An XML document represents a hierarchy A hierarchy is a tree novel foreword chapter number=&quot;1&quot; paragraph paragraph paragraph This is the great American novel. It was a dark and stormy night. Suddenly, a shot rang out!
  • 12. Viewing XML XML is designed to be processed by computer programs, not to be displayed to humans Nevertheless, almost all current Web browsers can display XML documents They do not all display it the same way They may not display it at all if it has errors This is just an added value. Remember: HTML is designed to be viewed, XML is designed to be used
  • 14. Stream Model Stream seen by parser is a sequence of elements As each XML element is seen, an event occurs Some code registered with the parser (the event handler) is executed This approach is popularized by the Simple API for XML (SAX) Problem: Hard to get a global view of the document Parsing state represented by global variables set by the event handlers
  • 15. Data Model The XML data is transformed into a navigable data structure in memory Because of the nesting of XML elements, a tree data structure is used The tree is navigated to discover the XML document This approach is popularized by the Document Object Model (DOM) Problem: May require large amounts of memory May not be as fast as stream approach Some DOM parsers use SAX to build the tree
  • 16. SAX and DOM SAX and DOM are standards for XML parsers DOM is a W3C standard SAX is an ad-hoc (but very popular) standard There are various implementations available Java implementations are provided as part of JAXP ( Java API for XML Processing ) JAXP package is included in JDK starting from JDK 1.4 Is available separately for Java 1.3
  • 17. Difference between SAX and DOM DOM reads the entire document into memory and stores it as a tree data structure SAX reads the document and calls handler methods for each element or block of text that it encounters Consequences: DOM provides &quot;random access&quot; into the document SAX provides only sequential access to the document DOM is slow and requires huge amount of memory, so it cannot be used for large documents SAX is fast and requires very little memory, so it can be used for huge documents This makes SAX much more popular for web sites
  • 19. Parsing with SAX SAX uses the source-listener-delegate model for parsing XML documents Source is XML data consisting of a XML elements A listener written in Java is attached to the document which listens for an event When event is thrown, some method is delegated for handling the code
  • 20. SAX Parsing: process XML as Stream
  • 21. Simple SAX program The program consists of two classes: Sample -- This class contains the main method; it Gets a factory to make parsers Gets a parser from the factory Creates a Handler object to handle callbacks from the parser Tells the parser which handler to send its callbacks to Reads and parses the input XML file Handler -- This class contains handlers for three kinds of callbacks: startElement callbacks, generated when a start tag is seen endElement callbacks, generated when an end tag is seen characters callbacks, generated for the contents of an element
  • 22. The Sample class import javax.xml.parsers.*; // for both SAX and DOM import org.xml.sax.*; import org.xml.sax.helpers.*; // For simplicity, we let the operating system handle exceptions // In &quot;real life&quot; this is poor programming practice public class Sample { public static void main(String args[]) throws Exception { // Create a parser factory SAXParserFactory factory = SAXParserFactory.newInstance(); // Tell factory that the parser must understand namespaces factory.setNamespaceAware(true); // Make the parser SAXParser saxParser = factory.newSAXParser(); XMLReader parser = saxParser.getXMLReader();
  • 23. The Sample class // Create a handler Handler handler = new Handler(); // Tell the parser to use this handler parser.setContentHandler(handler); // Finally, read and parse the document parser.parse(&quot;hello.xml&quot;); } // end of Sample class The parser reads the file hello.xml It should be located In the same directory In a directory that is included in the classpath
  • 24. The Handler class public class Handler extends DefaultHandler { DefaultHandler is an adapter class that defines empty methods to be overridden We define 3 methods to handle (1) start tags, (2) contents, and (3) end tags. The methods will just print a line Each of these 3 methods throws a SAXException // SAX calls this when it encounters a start tag public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes attributes) throws SAXException { System.out.println(&quot;startElement: &quot; + qualifiedName); }
  • 25. The Handler class // SAX calls this method to pass in character data public void characters(char ch[ ], int start, int length) throws SAXException { System.out.println(&quot;characters: \&quot;&quot; + new String(ch, start, length) + &quot;\&quot;&quot;); } // SAX call this method when it encounters an end tag public void endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException { System.out.println(&quot;Element: /&quot; + qualifiedName); } } // End of Handler class
  • 26. Results If the file hello.xml contains: <?xml version=&quot;1.0&quot;?> <display>Hello World!</display> Then the output from running java Sample will be: startElement: display characters: &quot;Hello World!&quot; Element: /display
  • 27. More results Now suppose the file hello.xml contains : <?xml version=&quot;1.0&quot;?> <display> <i>Hello</i> World! </display> Notice that the root element, <display> , contains a nested element <i> and whitespace (including newlines) The result will be as shown at the right: startElement: display characters: &quot;&quot; characters: &quot; &quot; characters: &quot; &quot; startElement: i characters: &quot;Hello&quot; endElement: /i characters: &quot;World!&quot; characters: &quot; &quot; endElement: /display // empty string // newline // spaces // another newline
  • 28. Factories SAX uses a parser factory A factory is a design pattern alternative to constructors Factories allow the programmer to: Decide whether or not to create a new object Decide what kind of object to create class TrustMe { private TrustMe() { } // private constructor public TrustMe makeTrust() { // factory method if ( /* test of some sort */) return new TrustMe(); } } }
  • 29. Parser factories To create a SAX parser factory, call static method: SAXParserFactory.newInstance() Returns an object of type SAXParserFactory It may throw a FactoryConfigurationError Then, the parser can be customized: public void setNamespaceAware(boolean awareness) Call this with true if you are using namespaces The default (if you don’t call this method) is false public void setValidating(boolean validating) Call this with true if you want to validate against a DTD The default (if you don’t call this method) is false Validation will give an error if you do not have a DTD
  • 30. Getting a parser Once a SAXParserFactory factory was set up, parsers can be created with: SAXParser saxParser = factory.newSAXParser(); XMLReader parser = saxParser.getXMLReader(); Note: SAXParser is not thread-safe If a parser will be used by in multiple threads, create a separate SAXParser object for each thread
  • 31. Declaring which handler to use Since the SAX parser will call the handlers, we need to supply these methods Binding the parser with a handler: Handler handler = new Handler(); parser.setContentHandler(handler); These statements could be combined: parser.setContentHandler(new Handler()); Finally, the parser is invoked on the file to parse: parser.parse(&quot;hello.xml&quot;); Everything else is done in the handler methods
  • 32. SAX handlers A callback handler must implement 4 interfaces: interface ContentHandler Handles basic parsing callbacks, e.g., element starts and ends interface DTDHandler Handles only notation and unparsed entity declarations interface EntityResolver Does customized handling for external entities interface ErrorHandler Must be implemented or parsing errors will be ignored! Implementing all these interfaces is a lot of work It is easier to use an adapter class
  • 33. Class DefaultHandler DefaultHandler is in an adapter from package org.xml.sax.helpers DefaultHandler implements ContentHandler , DTDHandler , EntityResolver , and ErrorHandler DefaultHandler provides empty methods for every method declared in each of the interfaces To use this class, extend it and override the methods that are important to the application
  • 34. ContentHandler methods public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException This method is called at the beginning of elements When SAX calls startElement , it passes in a parameter of type Attributes The following methods look up attributes by name rather than by index: public int getIndex(String qualifiedName) public int getIndex(String uri, String localName) public String getValue(String qualifiedName) public String getValue(String uri, String localName)
  • 35. ContentHandler methods endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException The parameters to endElement are the same as those to startElement , except that the Attributes parameter is omitted public void characters(char[] ch, int start, int length) throws SAXException ch is an array of characters Only length characters, starting from ch[start] , are the contents of the element
  • 36. Error Handling SAX error handling is unusual Most errors are ignored unless you an error handler org.xml.sax.ErrorHandler is registered Ignored errors can cause unexpected behavior The ErrorHandler interface declares: public void fatalError (SAXParseException exception) throws SAXException // XML not well structured public void error (SAXParseException exception) throws SAXException // XML validation error public void warning (SAXParseException exception) throws SAXException // minor problem
  • 37. External parsers Alternatively, you can use an existing parser: Xerces, Electric XML, Expat, MSXML, CMarkup Stages of the parsing Get the URL object for the source Create InputSource object encapsulating the data source Create the parser Launch the parser on the data source
  • 38. Problems with SAX SAX provides only sequential access to the document being processed SAX has only a local view of the current element being processed Global knowledge of parsing must be stored in global variables A single startElement() method for all elements In startElement() there are many “if-then-else” tests for checking a specific element When an element is seen, a global flag is set When finished with the element global flag must be set to false
  • 40. DOM DOM represents the XML document as a tree Hierarchical nature of tree maps well to hierarchical nesting of XML elements Tree contains a global view of the document Makes navigation of document easy Allows to modify any subtree Easier processing than SAX but memory intensive! As well as SAX, DOM is an API only Does not specify a parser Lists the API and requirements for the parser DOM parsers typically use SAX parsing
  • 41. DOM Parsing: process entire document
  • 42. Simple DOM program First we need to create a DOM parser, called a DocumentBuilder The parser is created, not by a constructor, but by calling a static factory method DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder();
  • 43. Simple DOM program An XML file hello.xml will be be parsed <?xml version=&quot;1.0&quot;?> <display>Hello World!</display> To read this file, we add the following line : Document document = builder.parse(&quot;hello.xml&quot;); document contains the entire XML file as a tree The following code finds the content of the root element and prints it Element root = document.getDocumentElement(); Node textNode = root.getFirstChild(); System.out.println(textNode.getNodeValue()); The output of the program is: Hello World!
  • 44. Reading in the tree The parse method reads in the entire XML document and represents it as a tree in memory For a large document, parsing could take a while If you want to interact with your program while it is parsing, you need to use parser in a separate thread Practically, an XML parse tree may require up to 10 times memory as the original XML document If you have a lot of tree manipulation to do, DOM is much more convenient than SAX If you do not have a lot of tree manipulation to do, consider using SAX instead
  • 45. Structure of the DOM tree The DOM tree is composed of Node objects Node is an interface Some of the more important sub-interfaces are Element , Attr , and Text An Element node may have children Attr and Text nodes are the leaves of the tree Hence, the DOM tree is composed of Node objects Node objects can be downcast into specific types if needed
  • 46. Operations on Node s The results returned by getNodeName() , getNodeValue() , getNodeType() and getAttributes() depend on the subtype of the node, as follows: Element Text Attr getNodeName() getNodeValue() getNodeType() getAttributes() tag name null ELEMENT_NODE NamedNodeMap &quot;#text&quot; text contents TEXT_NODE null name of attribute value of attribute ATTRIBUTE_NODE null
  • 47. Distinguishing Node types An easy way to handle different types of nodes: switch(node.getNodeType()) { case Node.ELEMENT_NODE: Element element = (Element)node; ...; break; case Node.TEXT_NODE: Text text = (Text)node; ... break; case Node.ATTRIBUTE_NODE: Attr attr = (Attr)node; ... break; default: ... }
  • 48. Operations on Node s Tree-walking methods that return a Node : getParentNode() getFirstChild() getNextSibling() getPreviousSibling() getLastChild() Test methods that return a boolean : hasAttributes() hasChildNodes()
  • 49. Operations for Element s String getTagName() Returns the name of the tag boolean hasAttribute(String name) Returns true if this Element has the named attribute String getAttribute(String name) Returns the value of the named attribute boolean hasAttributes() Returns true if this Element has any attributes NamedNodeMap getAttributes() Returns a NamedNodeMap of all the Element’s attributes
  • 50. Operations on Text s Text is a subinterface of CharacterData and inherits the following operations (among others): public String getData() throws DOMException Returns the text contents of this Text node public int getLength() Returns the number of Unicode characters in the text public String substringData(int offset, int count) throws DOMException Returns a substring of the text contents
  • 51. Operations on Attribute s String getName() Returns the name of this attribute. Element getOwnerElement() Returns the Element node this attribute is attached to String getValue() Returns the value of the attribute as a String
  • 52. Overview DOM, unlike SAX, gives allows to create and modify XML trees There are three basic kinds of operations: Creating a new DOM Modifying the structure of a DOM Modifying the content of a DOM Creating a new DOM requires a few extra methods just to get started Afterwards, you can add elements through modifying its structure and contents
  • 53. Creating a new DOM import javax.xml.parsers.*; import org.w3c.dom.Document; … try { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.newDocument(); } catch (ParserConfigurationException e) { ... }
  • 54. Creating structure The following are instance methods of Document : public Element createElement(String tagName) public Element createElementNS(String namespaceURI, String qualifiedName) public Attr createAttribute(String name) public Attr createAttributeNS(String namespaceURI, String qualifiedName) public ProcessingInstruction createProcessingInstruction (String target, String data) public EntityReference createEntityReference(String name) public Text createTextNode(String data) public Comment createComment(String data)
  • 55. Methods of Node public Node appendChild(Node newChild) public Node insertBefore(Node newChild, Node refChild) public Node removeChild(Node oldChild) public Node replaceChild(Node newChild, Node oldChild) setNodeValue(String nodeValue) Functionality depends on the type of the node
  • 56. Methods of Element public void setAttribute(String name, String value) public Attr setAttributeNode(Attr newAttr) public void setAttributeNodeNS(String namespaceURI, String qualifiedName, String value) public Attr setAttributeNodeNS(Attr newAttr) public void removeAttribute(String name) public void removeAttributeNS(String namespaceURI, String localName) public Attr removeAttributeNode(Attr oldAttr)
  • 57. Method of Attribute public void setValue(String value) This is the only method that modifies an Attribute The rest just retrieve information