By Alvin Alexander. Last updated: April 22, 2024
If you ever need to convert HTML to plain text using Scala or Java, I hope these Jsoup examples are helpful:
import org.jsoup.Jsoup import org.jsoup.nodes.{Document, Element} object JsoupHtmlToPlainTextTest extends App { val html = """ |<html> | <head><title>Hello, world</title></head> | <body> | <h1>Hello, world</h1> | <p>Hello, world.</p> | <p>This is a test.</p> | </body> |</html> """.stripMargin // Example 1: this works, but all output is on one line val doc: Document = Jsoup.parse(html) //val s: String = doc.text() //include <head> and <body> text val s: String = doc.body.text() //<body> text only //println(s) // Example 2: this works, output is on multiple lines val formatter = new JsoupFormatter val plainText = formatter.getPlainText(doc) //println(plainText) // Example 3: this works as a way to select the <body> only val body: String = doc.select("body").first.text() //println(body) // Example 4: works: gets text from paragraphs only // https://jsoup.org/cookbook/input/parse-body-fragment val doc4 = Jsoup.parseBodyFragment(html) val body4 = doc4.body() val paragraphs = body4.getElementsByTag("p") import scala.collection.JavaConverters._ val scalaParagraphs = asScalaBuffer(paragraphs) for (paragraph <- scalaParagraphs) { println(paragraph.text) } }
While this is just some test code that I’m currently working on to understand Jsoup, the code shows four different ways to convert the given HTML into plain text. Hopefully the comments explain how the HTML to plain text conversion processes work, so I won’t write more about them. I just wanted to share this code snippet here today a) so I can find it again, and b) in hopes it might help others that need to convert HTML to text using Jsoup.
this post is sponsored by my books: | |||
#1 New Release |
FP Best Seller |
Learn Scala 3 |
Learn FP Fast |