- Published on
Parsing HTML with jsoup
- Authors
- Name
- Esa Firman
- @esafirm
Although a lot of my friends and colleagues use jsoup
, i never had a chance to use it. It's my brain default to not choose Java as the language for parsing HTML.
There's a lot of boilerplate to do, but with Kotlin, it seems this process getting a little more fun
!
Introduction
Quoted from jsoup
website
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
What it can do
- Scrape and parse HTML from a URL, file, or
String
- Find and extract data, using DOM traversal or CSS selectors
- Manipulate the HTML elements, attributes and text
jsoup
doesn't support XPath. You can use xsoup for that.
Coding Time
Given Hacker News link, we want to extract each story link and format it like this
1. A Super Story About Some Startup (https://example.com)
2. Some weird story about JS (https://js.org)
First thing first, add jsoup
dependency to your build.gradle
compile 'org.jsoup:jsoup:1.10.3'
Now in Kotlin file
Jsoup.connect("https://news.ycombinator.com/").get().run {
select("td a.storylink").forEachIndexed { index, element ->
println("$index. ${element.text()} (${element.attr(\"href\")})")
}
}
Done! \ud83c\udf89 I think there's no more reason to not use jsoup
for your HTML parsing need \ud83d\ude2c
You can find the code on my Github
Bonus
For css selector, please check this cheatsheet!