How To Parse Html In Java - Part 2

Source

In the first article we saw how to parse html in Java using the libraries:

  • Jsoup
  • HtmlCleaner
  • Internal Swing SDK parser

This time we will see other useful libraries to parse html like:

  • TagSoup
  • HTML Parser
  • JTidy
  • NekoHtml

These examples aim to give you an idea of the kind of syntax and flexibility offered by these libraries.

TagSoup

TagSoup is a small and fast parser based on SAX and it is particularly adapted to parse html on the fly. Indeed, one of the advantages of using a SAX based library is that typically less memory is required and it is quicker than a DOM based library.

The code below shows you how to create a parser just using SAXParserImpl from TagSoup and overriding some methods of the SAX class DefaultHandler that belongs to JDK. Pretty simple isn't?

private static void parseWithTagSoup() throws Exception {
        System.out.println("*** TAGSOUP ***");
        SAXParserImpl.newInstance(null).parse(
                new ByteArrayInputStream(HTML.getBytes("UTF-8")),
                new DefaultHandler() {
                    public void startElement (String uri, String localName,
                                              String qName, org.xml.sax.Attributes attributes)
                            throws SAXException
                    {
                        System.out.println("OPEN TAG: " + qName);
                    }

                    public void characters (char ch[], int start, int length)
                            throws SAXException
                    {
                        System.out.println(Arrays.copyOfRange(ch, start, start+length));
                    }
                }
        );
    }

HTML Parser

HTML Parser is another good library focused on extraction (text and link extraction, link checking, screen scraping, resource extraction, etc) and transformation (URL rewriting, site capture, censorship, HTML cleanup, conversion to XML, etc).

The example below creates an instance of the Parser class with the string used as sample html passed in the constructor. Once the class is created and the parsing is executed, you can extract all nodes matching a given filter, for example:

extractAllNodesThatMatch(new TagNameFilter("h1"), true)

private static void parseWithHtmlParser() throws ParserException {
        System.out.println("*** HTMLPARSER ***");
        Parser parser = new Parser(HTML);
        NodeList nl = parser.parse(null);
        NodeList titles = nl.extractAllNodesThatMatch(new TagNameFilter("title"), true);
        printHtmlParserTagContents(titles);
        NodeList header1s = nl.extractAllNodesThatMatch(new TagNameFilter("h1"), true);
        printHtmlParserTagContents(header1s);
        NodeList trs = nl.extractAllNodesThatMatch(new TagNameFilter("tr"), true);
        printHtmlParserTagContents(trs);
        NodeList tds = nl.extractAllNodesThatMatch(new TagNameFilter("td"), true);
        printHtmlParserTagContents(tds);
    }

    private static void printHtmlParserTagContents(NodeList nodes) {
        for(int i=0;i<nodes.size();i++) {
            final Node node = nodes.elementAt(i);
            System.out.println(node.getText() + ": " + node.getChildren().asString());
        }
    }

JTidy

JTidy is a Java port of HTML Tidy (HTML syntax checker and pretty printer) but, in addition, it provides a DOM interface to the processed document. It allows you to use JTidy as a DOM parser and as shown in the example below. A note about that is the usage of some parameters like

private static void parseWithJTidy() throws Exception {
        System.out.println("*** JTIDY ***");
        Tidy tidy = new Tidy();
        tidy.setInputEncoding("UTF-8");
        tidy.setOutputEncoding("UTF-8");
        tidy.setWraplen(Integer.MAX_VALUE);
        tidy.setPrintBodyOnly(true);
        tidy.setXmlOut(true);
        tidy.setSmartIndent(true);
        ByteArrayInputStream inputStream = new ByteArrayInputStream(HTML.getBytes("UTF-8"));
        org.w3c.dom.Document doc = tidy.parseDOM(inputStream, System.out);

        print(doc);
    }

    private static void print(org.w3c.dom.Document doc) {
        final org.w3c.dom.Node title = doc.getElementsByTagName("title").item(0);
        System.out.println(title.getNodeName() + ": " + title.getFirstChild().getNodeValue());

        final org.w3c.dom.Node h1 = doc.getElementsByTagName("h1").item(0);
        System.out.println(h1.getNodeName() + ": " + h1.getFirstChild().getNodeValue());

        final org.w3c.dom.NodeList tds = doc.getElementsByTagName("td");
        for (int i=0; i<tds.getLength(); i++) {
            final org.w3c.dom.Node td = tds.item(i);
            System.out.println(td.getNodeName() + ": " + td.getFirstChild().getNodeValue());
        }
    }

NekoHtml

NekoHTML is an HTML scanner and tag balancer that very simply allows application programmers to parse HTML documents and access the information using standard XML interfaces. The parser is able to scan HTML files and correct many common mistakes that are made when HTML documents are written. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.

The code below is similar to the JTidy one and the method "print()" is exactly the same.

private static void parseWithNekoHtml() throws Exception {
	System.out.println("*** NEKOHTML ***");
        DOMParser parser = new DOMParser();
        parser.parse(new InputSource(new StringReader(HTML)));

        final org.w3c.dom.Document doc = parser.getDocument();

        print(doc);
    }

More by this Author


Comments

No comments yet.

    Sign in or sign up and post using a HubPages Network account.

    0 of 8192 characters used
    Post Comment

    No HTML is allowed in comments, but URLs will be hyperlinked. Comments are not for promoting your articles or other sites.


    Click to Rate This Article
    working