


With the select method, which is available in a Document, you can filter the elements you want. jsoup uses a CSS or jQuery-like selector syntax to allow you to find matching elements. Selecting the page’s elementsĪfter converting the HTML of the target page into a Document, we can now traverse it and get the information we are searching for. The get method represents the HTTP GET request made to retrieve the web pageįurthermore, the Jsoup class, which is the root for accessing jsoup’s functionalities, allows you to chain different methods so that you can perform advanced web scraping or complete other tasks.įor example, here is how you can imitate a user agent and specify request parameters: Document page = nnect("The Jsoup class uses the connect method to make a connection to the page’s URL.

jsoup loads and parses the page’s HTML content into a Document object.This is what is happening on the code above: With the parsable document markup, it’ll be easy to extract and manipulate the page’s content. Jsoup lets you fetch the HTML of the target page and build its corresponding DOM tree, which works just like a normal browser’s DOM. Here is the syntax for fetching the page: Document page = nnect("").get()

Fetching the web pageįor this jsoup tutorial, we’ll be seeking to extract the anchor texts and their associated links from this web page. Then, after installing the library, let’s import it into our work environment, alongside other utilities we’ll use in this project. You’ll need to add the following code to your pom.xml file, in the section:
