Web Scraping with Java Jsoup

Web Scraping with Java Jsoup

ยท

5 min read

In this blog, we'll learn about how web scraping is done with Java. In this blog, we're going to extract images from the URL provided by the user. A version of this project can be found at Image Extractor By Aashish Katwal. If you liked it, don't forget to give it a โญ and any contribution is warmly accepted.๐Ÿ˜Š

What is Web Scraping?

Web scraping is the process of obtaining data from a website on either a large scale or a smaller scale. We can obtain specific data such as Images, Tables, etc., or the source code of an entire website. This data obtained can be used for various purposes such as data harvesting, research, etc. After extracting such data, it can be used to get insights as required.

How does Web Scraping work?

A web Scraper can obtain all the data on a website or the desired one. First, we need to provide the URL of the website we want to scrape. It is good to specify what type of data we want so that the process is quick and efficient.

For example: If we want images from a website(which we are going to learn to scrape in this blog), we specify that we need only those elements with img tag to fetch. This scrapes every img tag it finds on the website with provided URL.

A web scraper loads all the HTML code from the URL, though some advanced scrapers can even load CSS and JavaScript. Such extracted data can be stored in an excel or CSV file or even a JSON file which can later be used for research and analysis or various other purposes.

Web Scraping with Java

In Java, there's a library called Jsoup, which is one of the most popular Java library for web scraping. I am doing a maven version where I will be using JSP. If you want to do it as a Java application or a normal Web project, you can download the jar file from their website and include it in your project.

  • index.jsp

<section>
<form method="POST" action="scrape">
   <input type="url" name="webURL" required/>
   <input type="submit" value="Scrape" />
</form>
</section>
 <% if (request.getSession().getAttribute("url") != null && request.getSession().getAttribute("validUrls") != null) { %>
<main>
    <h1 class="section-title">RESULT</h1>
    <div class="show-result">
       <%@include file="results.jsp" %>
    </div>
</main>
<% }
request.getSession().removeAttribute("url");
%>

This is from where a user submits the URL which is to be scraped to extract the images. The <%@include file="results.jsp" %> includes the content of result.jsp file inside the div with class show-result.

  • In servlet

@ Override
protected void doPost(HttpServletRequest request, HttpServletResponse response)
             throws ServletException, IOException {

     String url = request.getParameter("webURL");
     request.getSession().setAttribute("url", url);
     request.getSession().setAttribute("validUrls", new WebScrapper().getAllImgs(url));
     response.sendRedirect("/index.jsp");
}

This receives the URL entered by the user from the index.jsp page and sets that URL in the session attribute. Also, it calls the getAllImgs() function and sets that value to the session attribute.

  • In Java file(webScrapper.java)

    Image Extraction method:
public List<String> getAllImgs(String sUrl) {

        // keeps the valid image URLs. 
        List<String> validSourceUrls = new ArrayList<>();

        try {
            Document doc = Jsoup.connect(sUrl).get();

            for (Element element : doc.select("img[src]")) {
                String srcUrl = element.attr("src");
//                if (srcUrl.isBlank()) { 
// in JDK 11 or higher versions, we can use .isBlank() to check if the string is blank.
                if (srcUrl.length() > 0) {
                    if (validateSrcUrl(srcUrl)) {
                        if (srcUrl.contains("?")) {
                            // removes the queries from the url
                            validSourceUrls.add(srcUrl.substring(0, srcUrl.indexOf('?')));
                        } else {
                            validSourceUrls.add(srcUrl);
                        }
                    }
                }
            }
        } catch (IOException ex) {
            ex.printStackTrace();
        }

        return validSourceUrls;
    }

In this method, tasks like connection establishment with the provided URL, adding valid image src URL to a list, and returning those URLs are performed. The Jsoup.connect(sUrl).get() statement establishes connection and . get() fetches and parses the HTML file. The doc.select() then select the elements to be scraped and store it to element. Then the image source URL is checked for validation and adds the valid URLs to the list.

URL validation method:

public boolean validateSrcUrl(String url) {
     boolean isValid = false;
     List<String> validUrlItems = new ArrayList<>();
     validUrlItems.add("jpg");
     validUrlItems.add("png");
     validUrlItems.add("jpeg");
     validUrlItems.add("svg");
     validUrlItems.add("gif");

     // Splits the URL by .(dot) and checks if the last item of this array is one of the above extensions.
     if (validUrlItems.contains(url.split("\\.")[url.split("\\.").length - 1]) || url.contains(".github")) {
          isValid = true;
     }
     return isValid;
}

This method checks if the URL is a valid one that contains an image. For this, it checks if the link ends with any of the popular image extensions. the .github is there because Github images do not end with any extension.

  • result.jsp


<%@page import="java.util.List" %>

<%
String url = (String) request.getSession().getAttribute("url");

List<String> sources = (List<String>) request.getSession().getAttribute("validUrls");

for (String source : sources) {
   source = source.contains("https://") ? source : source != null ? url + "/" + source : "";
   String name = source.split("/")[source.split("/").length - 1];
%>
<div class="image-wrapper">
   <div class="image">
       <img src="<%= source%>" alt="alt"/>
   </div>
   <div class="details">
       <a href="<%= source%>" title="See this Image" target="_blank">Visit</a>
       <span class="imageName" title="<%= name.split("\\.")[0]%>">
             <%= name.split("\\.")[0] %>
       </span>
   </div>
</div>

<%
    }
    sources.clear();
%>

This receives the list with valid URLs returned by getAllImgs() function and inserts them in the src. This also displays the image name.

Final Words

Web Scraping has very wide application areas. It doesn't limit itself to extracting from one place and displaying it in another. It is used in varieties of sectors such as investment, startups, marketing, etc.

After some styling on above HTML code, this is what the final result looks like:

ย