Groovy: Stripping HTML With JSoup
As a follow-up to my post on sorting objects in Groovy, in the same script I was working on parsing together comment threads. That was the reason why I needed to sort things by date. The content of each comment was given to me in HTML, though, and the system I needed to import the data into only accepts plaintext. As a result, I needed to find a way to strip all of the extraneous HTML tags, which frequently included embedded CSS, leaving me with just the plaintext content.
Before wasting too much time on what would undoubtedly be a non-trivial task, I figured there was no way that I was the only person who ever needed to do this. Since Groovy is a superset of Java, I knew there was no way there wasn’t a Java library designed to do exactly this. After a quick search online, it seemed that JSoup is by far the most popular library for this in the Java world. It’s available on Maven, meaning that I could very easily import it into my Groovy script with Grapes the same way that I did with the msal4j library. A quick test showed me that this was exactly what I needed with this sample code:
The two main aspects are obviously the import to get the library:
@Grapes(
@Grab(group='org.jsoup', module='jsoup', version='1.13.1')
)import org.jsoup.*
And the the Jsoup
call to parse the string:
String cleanedText = Jsoup.parse(sampleHTML).text()
It was really that simple. The JSoup library is actually quite fully-featured, allowing me to make an HTTP connection and pull the content directly from the web if needed. In my case, though, I already had the content in question via API calls, so I just had to tell JSoup to parse a string rather than making any network connections.
Originally published at https://looped.network on May 17, 2021.