Thursday 4 June 2009

From XML to Java using SAX Parser

XML is a common way to save data especially while in transit between applications. For example, a program used to register students (written by a company) needs extract/load data produced by another program written by another company (may be in a different language). One of the best ways to do this is by having the source program producing an XML file and the destination program loading this file.

This article discusses how the Java SAX parser can be used to load an XML file as a List of Java objects.

The problem

Assume we need to develop a program that read the following simple XML file and loads it as a list of Java objects. The file is called Test.xml.


<students>
  <student name="Albert Attard">
    <class>Java</class>
    <class>Math</class>
    <grade>65</grade>
  </student>
  <student name="Mary Borg">
    <class>English</class>
    <grade>93</grade>
  </student>
  <student name="Joe Vella">
    <class>Math</class>
    <class>English</class>
    <grade>47</grade>
  </student>
  <student name="Paul Galea">
    <class>Math</class>
    <class>Maltese</class>
    <grade>52</grade>
  </student>
</students>

The above XML file contains a list of four students, each student having a name, a list of classes that he/she will attend and their grade. The student name is an attribute of the student tag while the class and grade are inner tags of the student tag.

Loading data from XML file

There are various ways how we can load data from XML files in the Java language, each have cons and prons. Two common ways are DOM and SAX. What are these and why we're using one and not the other? The main difference between the two is that the DOM parser loads the XML file in a tree structure using DOM related classes. Once loaded, we can then create our structure from the DOM tree. On the other hand SAX parser does not load the XML but triggers a series of events thought which we can build our structure. SAX does not load anything into the memory.

The SAX parser

Before proceeding, let's first understand the SAXParser and the required elements. The SAX parser requires a handler to handle the events triggered by the same SAX parser. The handler is a java class that extends the DefaultHandler and provides implementation for some (or many) of the methods in the default handler. Note that the default handler implements a set of interfaces and provides a default implementation (methods doing nothing) for all inherited abstract methods.

SAX Parser

The SAX parser is created using the SAXParserFactory and the XML file is parsed using the provided handler (the default handler in this case) as illustrated in the following example:


import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.helpers.DefaultHandler;

public class Example1 {
  public static void main(String[] args) throws Exception {
    DefaultHandler handler = new DefaultHandler();
    
    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser p = factory.newSAXParser();
    p.parse(Example1.class.getResourceAsStream("Test.xml"), 
            handler);
  }
}

Note that, here I'm using the default package for all examples and the Test.xml file is in the same folder as the class files. If you move the classes into a package make sure to also move the xml file with them or emend the file path accordingly.

Executing the above example will produce nothing as the default handler simply ignores all events triggered by the parser. The SAX parses starts parsing the file and for every tag opened and close it invokes specific methods in the handler and pass the information from the XML file. Let's create a simple handler and some basic methods to helps us understand this better. This handler will be used to load a list of students from the XML file define above (Test.xml).


import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class StudentHandler1 extends DefaultHandler {

  @Override
  public void characters(char[] ch, int start, int length)
                         throws SAXException {
  }

  @Override
  public void endElement(String uri, String localName, 
                         String name) throws SAXException {
  }

  @Override
  public void startDocument() throws SAXException {
  }

  @Override
  public void startElement(String uri, String localName, 
                           String name,Attributes attributes) 
                           throws SAXException {
  }
}

The above class override four methods from the default handler class.

The startDocument method is invoked once when the SAX parser start parsing the XML document. This method has not parameters and can be used to initialise fields (similar to a constructor or initialising block). The default handler also includes the endDocument method which is invoked when the SAX parser is done parsing the file. This method can be used as a destructor method (or the final block) to clean up or wrap up and resources/fields as required.

The startElement method is invoked when the SAX parser encounters an XML tag. For example, this method is invoked when the SAX parser encounters an open tag such as: <students>. Similar, the endElement method is invoked by the parser when the close tag is found (for example: </students>).

Finally, the characters method is invoked when the parser encounters the tag body. The tag body is the text (not XML) within the tags. For example, the characters for the grade XML tag: <grade>85</grade>, is 85. Note that whitespaces are not removed or truncated and must be handled by the handler.

Simple example

Let's start with a simple example that counts the number of students in the XML file (Test.xml). All we need to do here is create an integer field, initialise it to 0 in the start document method and for every student tag we increment this variable. We removed the unnecessary method from the previous handler (StudentHandler1 ) and added the end document method.


import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class StudentHandler2 extends DefaultHandler {

  private int studentsCount;

  @Override
  public void endDocument() throws SAXException {
    System.out.println(studentsCount);
  }

  @Override
  public void startDocument() throws SAXException {
    studentsCount = 0;
  }

  @Override
  public void startElement(String uri, String localName, 
                           String name, Attributes attributes) 
                           throws SAXException {
    if ("student".equalsIgnoreCase(name)) {
      studentsCount++; // Increment statement
    }
  }
}

Note that we enclosed the increment statement within the if statement. The if statement is checking the name of the XML tag. Note that our XML document has four different tags: students, student, class and grade. We only want to increment our counter when the student tag is opened. Also, the if statement is comparing the tag but ignoring case (case insensitive) as XML is not case sensitive and the tags may be in upper case.

Now, using the new handler (StudentHandler2), we can process our XML file (Test.xml) and see how many students we have.


import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

public class Example2 {
  public static void main(String[] args) throws Exception {
    StudentHandler2 handler = new StudentHandler2();

    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser p = factory.newSAXParser();
    p.parse(Example2.class.getResourceAsStream("Test.xml"), 
            handler);
  }
}

The above should produce 4 when executed against the Test.xml file.

As you can see, we didn’t loaded any data from the XML file into the memory. Instead we only performed the require operation (counting the number of student in this case). Using DOM here would be an overkill as all the students would have been loaded into the memory for nothing. We only required a count. I'm not saying that DOM is not good. All I'm saying is that DOM is not the right tool for this job.

Some XML processing

Let's improve our parser and calculate the average grade for all students. In order to calculate the average, we need to first calculate the sum and then divide by the number of students. This is not as simple as it sound as the SAX parser invokes the handler's methods independent from the tag. For example, the startElement method is invoked for every XML open tag. Same applies for all other handler methods. Thus we have to keep track which tag we're handling.

Let's understand this problem first. We need to get the contents between the grade's XML tag. This can be retrieved from the characters method. But this method is executed for every tag (as we established above). So we first need to see which tag is being processed from the start element method. Using a boolean field (referred to as addToTotal in the following example), we set this field to true when the start element method is invoked for the grade tag. Then, we only process characters when this field is set to true. We have to remember to set this field to false once the grade tag is processed. This can be done in the end element method. All this is captured in the following example. Note that changes are shown in bold.


import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class StudentHandler3 extends DefaultHandler {

  private boolean addToTotal;
  private StringBuffer buffer;
  private int studentsCount;
  private int totalGrade;

  @Override
  public void characters(char[] ch, int start, int length)
      throws SAXException {
    if (addToTotal) {
      buffer.append(ch, start, length);
    }
  }

  @Override
  public void endDocument() throws SAXException {
    System.out.println(studentsCount);
    System.out.println(totalGrade * 1.0 / studentsCount);
  }

  @Override
  public void endElement(String uri, String localName, 
                         String name) throws SAXException {
    if ("grade".equalsIgnoreCase(name)) {
      addToTotal = false;
      totalGrade += Integer.parseInt(buffer.toString().trim());
    }
  }

  @Override
  public void startDocument() throws SAXException {
    studentsCount = 0;
    totalGrade = 0;
  }

  @Override
  public void startElement(String uri, String localName, 
                           String name, Attributes attributes) 
                           throws SAXException {
    if ("student".equalsIgnoreCase(name)) {
      studentsCount++;
    } else if ("grade".equalsIgnoreCase(name)) {
      addToTotal = true;
      buffer = new StringBuffer();
    }
  }
}

Why are we using a string buffer in the characters method? The XML tag body (where the grade value is) can be very long and spread across multiple lines. For example, we can have the following:


<grade>
65
</grade>

In this case, the character method will be invoked three times (one for every line). Initially, it will be called with the new-line UNICODE symbol (\n), then with the text 65, and finally with the other new-line UNICODE symbol. Thus we first need to accumulate all content (in this case: \n65\n) and then remove all leading and trailing whitespaces before parsing it into an integer. Note that our example will throw a NumberFormatException is the given grade is not an integer.

Handling XML attributes

The XML student tag also includes an attribute, the student name. We can get this value and from the start element method's Attributes parameter. The attribute parameter holds all attributes that belong to the tag being handled. The SAX parser takes care of populating the attributes when parsing the XML document.


import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class StudentHandler4 extends DefaultHandler {

  @Override
  public void startElement(String uri, String localName, 
                           String name, Attributes attributes) 
                           throws SAXException {
    if ("student".equalsIgnoreCase(name)) {
      String studentName = attributes.getValue("name");
      System.out.println(studentName);
    }
  }
}

The above handler will list all student names one following another. We can enhance this handler and save all students and their grade in a list as illustrated below. Ideally, before we proceed we create a Java class that represents the student. In this case the class only requires two fields, that is, the name and the grade. Note that in our problem we're not handling the student's class XML tag (we're ignoring it).


public class Student1 {

  private int grade;
  private String name;

  // The getters and setters are omitted for brevity

  @Override
  public String toString() {
    return name + " " + grade;
  }
}

Putting it all together

Let's now combine everything together and build a list of student from the XML file. We need to change some of the fields and introduce new ones. For example, we need a list to put all the students in (listOfStudents) and we need a temporary variable to save the student until this is added into the list (student). No changes are highlighted in the following example as there are many changes.


import java.util.List;
import java.util.Vector;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class StudentHandler5 extends DefaultHandler {

  private StringBuffer buffer;
  private boolean isStudentGrade;
  private List listOfStudents;
  private Student1 student;

  @Override
  public void characters(char[] ch, int start, int length)
      throws SAXException {
    if (isStudentGrade) {
      buffer.append(ch, start, length);
    }
  }

  @Override
  public void endDocument() throws SAXException {
    System.out.println(listOfStudents);
  }

  @Override
  public void endElement(String uri, String localName, 
                         String name) throws SAXException {
    if ("grade".equalsIgnoreCase(name)) {
      isStudentGrade = false;
      student.setGrade(
        Integer.parseInt(buffer.toString().trim()));
    } else if ("student".equalsIgnoreCase(name)) {
      listOfStudents.add(student);
      student = null;
    }
  }

  @Override
  public void startDocument() throws SAXException {
    listOfStudents = new Vector();
  }

  @Override
  public void startElement(String uri, String localName, 
                           String name, Attributes attributes) 
                           throws SAXException {
    if ("student".equalsIgnoreCase(name)) {
      String studentName = attributes.getValue("name");
      student = new Student1();
      student.setName(studentName);
    } else if ("grade".equalsIgnoreCase(name)) {
      isStudentGrade = true;
      buffer = new StringBuffer();
    }
  }
}

Why the student is added into the list when the student close tag is processed when we could do this at the grade close tag? The reason behind that is if in the future we also add more fields, such as the student's class, we can easily do so with minimal changes to the code.

Conclusion

In this article I covered how to use the SAX parse for simple processing of XML files. The SAX parser is ideal when we need to perform some processing such as count the number of students or calculate the average grade without having to load the entire XML file into the memory.

2 comments:

  1. you might also want to look at vtd-xml, the latest and most advanced XML processing API available today

    http://vtd-xml.sf.net

    ReplyDelete
  2. Thanks for great article. I had problems with understanding saxparser as I have just started learning Java to make an Android app - and it has solved all my problems. Really big thanks!

    ReplyDelete