Parsing HTML with jsoup

Although a lot of my friends and colleagues use jsoup, i never had a chance to use it. It’s my brain default to not choose Java as the language for parsing HTML.

There’s a lot of boilerplate to do, but with Kotlin, it seems this process getting a little more fun!

Introduction

Quoted from jsoup website

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

What it can do
  • Scrape and parse HTML from a URL, file, or String
  • Find and extract data, using DOM traversal or CSS selectors
  • Manipulate the HTML elements, attributes and text

jsoup doesn’t support XPath. You can use xsoup for that.

Coding Time

Given Hacker News link, we want to extract each story link and format it like this

1. A Super Story About Some Startup (https://example.com)
2. Some weird story about JS (https://js.org)

First thing first, add jsoup dependency to your build.gradle

compile 'org.jsoup:jsoup:1.10.3'

Now in Kotlin file

Jsoup.connect("https://news.ycombinator.com/").get().run {
        select("td a.storylink").forEachIndexed { index, element ->
            println("$index. ${element.text()} (${element.attr(\"href\")})")
        }
    }

Done! \ud83c\udf89 I think there’s no more reason to not use jsoup for your HTML parsing need \ud83d\ude2c

You can find the code on my Github

Bonus

For css selector, please check this cheatsheet!

Published 9 Jul 2017

Technical Stuff. Rants
Esa Firman on Twitter