Coded Crunch

Introduction to web scraping

Web scraping is one of the tools at a developer’s disposal when looking to gather data from the internet. While consuming data via an API has become commonplace, most of the websites online don’t have an API for delivering data to consumers. In order to access the data they’re looking for, web scrapers and crawlers read a website’s pages and feeds, analyzing the site’s structure and markup language for clues. Generally speaking, information collected from scraping is fed into other programs for validation, cleaning, and input into a datastore or its fed onto other processes such as natural language processing (NLP) toolchains or machine learning (ML) models. There are a few Python packages we could use to illustrate with, but we’ll focus on Scrapy for these examples. Scrapy makes it very easy for us to quickly prototype and develop web scrapers with Python.

Scrapy concepts

Before we start looking at specific examples and use cases, let’s brush up a bit on Scrapy and how it works.
Spiders: Scrapy uses Spiders to define how a site (or a bunch of sites) should be scraped for information. Scrapy lets us determine how we want the spider to crawl, what information we want to extract, and how we can extract it. Specifically, Spiders are Python classes where we’ll put all of our custom logic and behavior.

import scrapy

class NewsSpider(scrapy.Spider):
 name = 'news'
 ...

Selectors: Selectors are Scrapy’s mechanisms for finding data within the website’s pages. They’re called selectors because they provide an interface for “selecting” certain parts of the HTML page, and these selectors can be in either CSS or XPath expressions.
Items: Items are the data that is extracted from selectors in a common data model. Since our goal is a structured result from unstructured inputs, Scrapy provides an Item class which we can use to define how our scraped data should be structured and what fields it should have.

import scrapy

class Article(scrapy.Item):
 headline = scrapy.Field()
 ...

Reddit-less front page

Suppose we love the images posted to Reddit, but don’t want any of the comments or self posts. We can use Scrapy to make a Reddit Spider that will fetch all the photos from the front page and put them on our own HTML page which we can then browse instead of Reddit.
To start, we’ll create a RedditSpider which we can use traverse the front page and handle custom behavior.

import scrapy

class RedditSpider(scrapy.Spider):
 name = 'reddit'
 start_urls = [
         'https://www.reddit.com'
 ]

Above, we’ve defined a RedditSpider, inheriting Scrapy’s Spider. We’ve named it reddit and have populated the class’ start_urls attribute with a URL to Reddit from which we’ll extract the images.
At this point, we’ll need to begin defining our parsing logic. We need to figure out an expression that the RedditSpider can use to determine whether it’s found an image. If we look at Reddit’s robots.txt file, we can see that our spider can’t crawl any comment pages without being in violation of the robots.txt file, so we’ll need to grab our image URLs without following through to the comment pages.
By looking at Reddit, we can see that external links are included on the homepage directly next to the post’s title. We’ll update RedditSpider to include a parser to grab this URL. Reddit includes the external URL as a link on the page, so we should be able to just loop through the links on the page and find URLs that are for images.

class RedditSpider(scrapy.Spider):
    ...
    def parse(self, response):
       links = response.xpath('//a/@href')
      for link in links:
           ...

In a parse method on our RedditSpider class, I’ve started to define how we’ll be parsing our response for results. To start, we grab all of the href attributes from the page’s links using a basic XPath selector. Now that we’re enumerating the page’s links, we can start to analyze the links for images.

def parse(self, response):
    links = response.xpath('//a/@href')
    for link in links:
        # Extract the URL text from the element
        url = link.get()
        # Check if the URL contains an image extension
        if any(extension in url for extension in ['.jpg', '.gif', '.png']):
            ...

To actually access the text information from the link’s href attribute, we use Scrapy’s .get() function which will return the link destination as a string. Next, we check to see if the URL contains an image file extension. We use Python’s any() built-in function for this. This isn’t all-encompassing for all image file extensions, but it’s a start. From here we can push our images into a local HTML file for viewing.

def parse(self, response):
    links = response.xpath('//img/@src')
    html = ''

    for link in links:
        # Extract the URL text from the element
        url = link.get()
        # Check if the URL contains an image extension
        if any(extension in url for extension in ['.jpg', '.gif', '.png']):
            html += '''
            < a href="{url}" target="_blank">
                < img src="{url}" height="33%" width="33%" />
            < /a>
            '''.format(url=url)

     # Open an HTML file, save the results
         with open('frontpage.html', 'a') as page:
            page.write(html)
         # Close the file
         page.close()

To start, we begin collecting the HTML file contents as a string which will be written to a file called frontpage.html at the end of the process. You’ll notice that instead of pulling the image location from the ‘//a/@href/‘, we’ve updated our links selector to use the image’s src attribute: ‘//img/@src’. This will give us more consistent results, and select only images.
As our RedditSpider’s parser finds images it builds a link with a preview image and dumps the string to our html variable. Once we’ve collected all of the images and generated the HTML, we open the local HTML file (or create it) and overwrite it with our new HTML content before closing the file again with page.close(). If we run scrapy runspider reddit.py, we can see that this file is built properly and contains images from Reddit’s front page.
But, it looks like it contains all of the images from Reddit’s front page – not just user-posted content. Let’s update our parse command a bit to blacklist certain domains from our results.
If we look at frontpage.html, we can see that most of Reddit’s assets come from redditstatic.com and redditmedia.com. We’ll just filter those results out and retain everything else. With these updates, our RedditSpider class now looks like the below:

import scrapy

class RedditSpider(scrapy.Spider):
    name = 'reddit'
    start_urls = [
        'https://www.reddit.com'
    ]

    def parse(self, response):
        links = response.xpath('//img/@src')
       html = ''

       for link in links:
            # Extract the URL text from the element
         url = link.get()
         # Check if the URL contains an image extension
         if any(extension in url for extension in ['.jpg', '.gif', '.png'])\
               and not any(domain in url for domain in ['redditstatic.com', 'redditmedia.com']):
                html += '''
                < a href="{url}" target="_blank">
                    < img src="{url}" height="33%" width="33%" />
                < /a>
                '''.format(url=url)

        # Open an HTML file, save the results
            with open('frontpage.html', 'w') as page:
               page.write(html)

        # Close the file
        page.close()

We’re simply adding our domain whitelist to an exclusionary any()expression. These statements could be tweaked to read from a separate configuration file, local database, or cache – if need be.

Extracting Amazon price data

If you’re running an ecommerce website, intelligence is key. With Scrapy we can easily automate the process of collecting information about our competitors, our market, or our listings.
For this task, we’ll extract pricing data from search listings on Amazon and use the results to provide some basic insights. If we visit Amazon’s search results page and inspect it, we notice that Amazon stores the price in a series of divs, most notably using a class called .a-offscreen. We can formulate a CSS selector that extracts the price off the page:

prices = response.css('.a-price .a-offscreen::text').getall()

With this CSS selector in mind, let’s build our AmazonSpider.

import scrapy

from re import sub
from decimal import Decimal


def convert_money(money):
 return Decimal(sub(r'[^\d.]', '', money))


class AmazonSpider(scrapy.Spider):
 name = 'amazon'
 start_urls = [
         'https://www.amazon.com/s?k=paint'
 ]

 def parse(self, response):
         # Find the Amazon price element
         prices = response.css('.a-price .a-offscreen::text').getall()

         # Initialize some counters and stats objects
         stats = dict()
         values = []

         for price in prices:
           value = convert_money(price)
           values.append(value)

         # Sort our values before calculating
         values.sort()

         # Calculate price statistics
         stats['average_price'] = round(sum(values) / len(values), 2)
         stats['lowest_price'] = values[0]
         stats['highest_price'] = values[-1]
         Stats['total_prices'] = len(values)

         print(stats)

A few things to note about our AmazonSpider class: convert_money(): This helper simply converts strings formatted like ‘$45.67’ and casts them to a Python Decimal type which can be used for computations and avoids issues with locale by not including a ‘$’ anywhere in the regular expression. getall(): The .getall() function is a Scrapy function that works similar to the .get() function we used before, but this returns all the extracted values as a list which we can work with. Running the command scrapy runspider amazon.py in the project folder will dump output resembling the following:

{'average_price': Decimal('38.23'), 'lowest_price': Decimal('3.63'), 'highest_price': Decimal('689.95'), 'total_prices': 58}

It’s easy to imagine building a dashboard that allows you to store scraped values in a datastore and visualize data as you see fit.

Considerations at scale

As you build more web crawlers and you continue to follow more advanced scraping workflows you’ll likely notice a few things:

Sites change, now more than ever.
Getting consistent results across thousands of pages is tricky.
Performance considerations can be crucial.

Sites change, now more than ever

On occasion, AliExpress for example, will return a login page rather than search listings. Sometimes Amazon will decide to raise a Captcha, or Twitter will return an error. While these errors can sometimes simply be flickers, others will require a complete re-architecture of your web scrapers. Nowadays, modern front-end frameworks are oftentimes pre-compiled for the browser which can mangle class names and ID strings, sometimes a designer or developer will change an HTML class name during a redesign. It’s important that our Scrapy crawlers are resilient, but keep in mind that changes will occur over time.

Getting consistent results across thousands of pages is tricky

Slight variations of user-inputted text can really add up. Think of all of the different spellings and capitalizations you may encounter in just usernames. Pre-processing text, normalizing text, and standardizing text before performing an action or storing the value is best practice before most NLP or ML software processes for best results.

Performance considerations can be crucial

You’ll want to make sure you’re operating at least moderately efficiently before attempting to process 10,000 websites from your laptop one night. As your dataset grows it becomes more and more costly to manipulate it in terms of memory or processing power. In a similar regard, you may want to extract the text from one news article at a time, rather than downloading all 10,000 articles at once. As we’ve seen in this tutorial, performing advanced scraping operations is actually quite easy using Scrapy’s framework. Some advanced next steps might include loading selectors from a database and scraping using very generic Spider classes, or by using proxies or modified user-agents to see if the HTML changes based on location or device type. Scraping in the real world becomes complicated because of all the edge cases, Scrapy provides an easy way to build this logic in Python.

This post is a part of Kite’s new series on Python. You can check out the code from this and other posts on our GitHub repository.

This Blog originally appeared on Kite

What is continuous integration?

Continuous integration (CI) is a process in which whenever a developer check-in code in shared repository, the whole project is automatically build and tested, in order to find integration issues.

Why continuous integration is required if we can do this task manually also?

If we talk about the current software development system, as soon as the developer develops the code he pushes the code in the shared repository and at the same time when he is integrating two module, he wants that whether the built is stable or not. He is performing unit testing, whereas the integration testing is done by another team (testing team) and then the code is finally deployed. If we do this manually we need to wait for the testing team to test the whole build and give a go ahead if the built is successful or else again developer has to fix the issue if the built is not successful and has to wait for the go ahead from the testing team. So in order to automate the whole system, Continuous integration comes into play.

Continuous integration reveals the integration issues at very early stage.

How CI works?

As a developer I check-in the code to my local repository (suppose it may be Git). Once all the check-in has been done, I will push the entire code to the remote repository (suppose it may be Git Hub), where all the developer at the end of the day pushes their code. There is a CI server which is actually tracking the remote repository. Suppose anything new happened in remote repository whether it is addition of any code or modification or any new build has come into the repository than this CI server will actually going to perform the compilation, unit testing, execute the regression testing and publish the reports. Here, build management tool such as Maven/ ANT is also configured with CI tools to perform the task.

Benefits of CI

Continuously integrates the whole project, thereby eliminating the Integration phase.
Catches Issues at very early stage.
Reduces time for de-bugging, as issues are caught at each integration.
Help to deliver the product more rapidly.

Famous CI tools

Jenkins
Hudson
Team city
Team foundation server
GitLab

In above Jenkins and Hudson are free CI tools.

What is Jenkins?

Jenkins is an open-source Continuous integration tool/server, which is used to automate all sort of task related to building, testing, delivering or deploying the whole project whenever needed.

Jenkins is originally created as Hudson, but after dispute with Oracle, this project is separated by open source community as “Jenkins”.

Installing and Running Jenkins

Use following steps to install and run Jenkins on local environment:

Go to http://jenkins-ci.org/ and download Jenkins server in .war format.

Open CMD and run following command to start the server : java –jar jenkins.war –httpPort=8080.

Note: - If 8080 port is in use give the port number such as 9090 or 8090.

3. By default Jenkins runs on 8080 port, so open a browser and hit URL as http://localhost:8080

Configuring Jenkins Plugin

More than 600 plug-ins are available to customize Jenkins as per project requirements.

To Manage Jenkins plug-ins use following steps:

Open Jenkins in browser
Click on “Manage Jenkins”
Click on “Manage Plug-ins”.
Go to “Available Plug-ins” and choose the plug-in you need.

Note: - The important plugin that should be installed are CVS plugin, Git plugin and GitClient plugin.

Jenkins configuration wizard

In order to connect Jenkins with other tools (i.e java, maven etc.) first we need to configure the Jenkins. Use following steps to configure Jenkins:

Open Jenkins in any browser.
Click on “Manage Jenkins”
Click on “Global tool configuration”

Now configure following :

Java :

a) Set the path of “Java_Home” variable as of System variable. (e.g. C:\Program Files\Java\jdk1.8.0_51)

Git :

a) Set the path of Git installation directory (e.g. C:\Program Files\Git\bin\git.exe)

Maven:

a) Set the path of Maven installation directory as of Maven_Home variable ( e.g. E:\maven2\apache-maven-3.0.5-bin\apache-maven-3.0.5)

5. Now “save” the above settings.

MAVEN

Apache Maven is a software project management and build management tool for java frameworks.

It helps in maintaining and managing the projects (development code, test cases, and frameworks).

Its development process is very similar to ANT. However, it is much advanced than ANT.

Why MAVEN?

1) Central repository to get dependencies.

Maven has its own repository site (Maven Repository) where you will get all the jar files and libraries

of software present in market so far. So whenever you build your project with maven configuration you

may need not provide any jars for your projects, the maven project which you have build will

automatically connect to this maven repository site and look for the jars you want and download the

jars from the site and place it in your project build path.

2) Maintaining common structure across the organization.

In a company there are multiple teams working on different frameworks and each team defines there

framework with different formats and different structure. This lead to inconsistency in defining the

framework in company. To maintain the consistency we need a common framework structure. Here

the role of maven come in play. Maven automatically suggest some template, for test cases it suggest

one template and for java development it suggest one more template. So simply we can convert our

project using maven and get that template and introduce all our test cases according to that template or

inject our code according to that template.

3) Flexibility in integration with Continuous integration tools.

Suppose you need to execute 10000 test cases on a single night than you will not go to each and every

test case and execute them, for this you need some continuous integration tools such as Jenkins to do the

task. To make the project compatible with Jenkins you need a build management tool for your

framework. Here is the role of maven comes into play.

4) Plugins for test framework execution.

Maven provides excellent plugins for testing.

How to Install Maven

Pre-requisites: Install java in your system.

1) Go to the official website (Maven Download).

2) Click on bin.zip for windows download and bin.tar for Mac.

3) Unzip the folder.

4) Set the Maven home in your system variables just like you have set your java.

5) Open cmd and type “mvn --version” and press enter.

If you get this it means you have successfully configured Maven in your machine.

Note: - If you see something like Maven is not recognized or some external error, it means you have

not properly set the maven in your system variable.

Understanding Maven terminologies

1) GroupId: It will identify your project uniquely across all projects.

2) Artifact: It is a file usually a JAR that gets deployed to the maven repository.

3) Archetype:generate: Generates a new project from archetype.

Suppose I need selenium JAR in my code.

<groupId>org.seleniumhq.selenium</groupId>

<artifactId>selenium-java</artifactId>

</dependency>

Here group id represents that I need jar files of selenium and artifactid tells us that I need jar files of
selenium for java codes and version tells about the version of the jar file.

Creating a Maven Project using cmd

The command for creating the Maven project is:

mvn archetype:generate -DgroupId=com.mycompany.app -DartifactId=

my-app -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4

-DinteractiveMode=false

Here,

Artifactid stand as a project name

Maven-archetype-quickstart is the template that is mostly used for test framework development.

Hence, this command will create a dummy maven project with neat hierarchy skeleton for you.

Note: - if you want to use this maven project with eclipse you cannot use directly as this project does
not contain .classpath file. For creating .classpath file outside the eclipse you have to perform certain
steps.

1) Open cmd and reach the directory where your maven project xml file is present.

2) Now write “mvn eclipse:eclipse” and press enter.

Now you can import this project on eclipse.

By default all the current version of eclipse are having maven plugins

Where the jars does installed in the machine?

It is installed in .m2 repository in the system. As soon as we install maven in the system it creates a local

repository named as .m2 and stores all the jars here only.

Coded Crunch

Hi, We are Coded Crunch /

The place where we write some words

Introduction to web scraping

Scrapy concepts

Reddit-less front page

Extracting Amazon price data

Considerations at scale

Sites change, now more than ever

Getting consistent results across thousands of pages is tricky

Performance considerations can be crucial

Blog Archive

Labels

latest posts

latest comments

Style Crunch

Coded Crunch