Camtasia is one of the best video editing software for the standard video editing. Using Camtasia, you can record, edit, share video directly to YouTube. From last 3 years, I am personally using Camtasia for editing my YouTube videos [HACKANONS] and for the screen recording.
Coded Crunch
Hi, We are Coded Crunch /
The place where we write some words
Introduction to web scraping
Web scraping is one of the tools at a developer’s disposal when looking to gather data from the internet. While consuming data via an API has become commonplace, most of the websites online don’t have an API for delivering data to consumers. In order to access the data they’re looking for, web scrapers and crawlers read a website’s pages and feeds, analyzing the site’s structure and markup language for clues. Generally speaking, information collected from scraping is fed into other programs for validation, cleaning, and input into a datastore or its fed onto other processes such as natural language processing (NLP) toolchains or machine learning (ML) models. There are a few Python packages we could use to illustrate with, but we’ll focus on Scrapy for these examples. Scrapy makes it very easy for us to quickly prototype and develop web scrapers with Python.Scrapy concepts
Before we start looking at specific examples and use cases, let’s brush up a bit on Scrapy and how it works.Spiders: Scrapy uses Spiders to define how a site (or a bunch of sites) should be scraped for information. Scrapy lets us determine how we want the spider to crawl, what information we want to extract, and how we can extract it. Specifically, Spiders are Python classes where we’ll put all of our custom logic and behavior.
import scrapy
class NewsSpider(scrapy.Spider):
name = 'news'
...
Selectors: Selectors are Scrapy’s mechanisms for finding data within the website’s pages. They’re called selectors because
they provide an interface for “selecting” certain parts of the HTML
page, and these selectors can be in either CSS or XPath expressions.
Items: Items are the data that is extracted from selectors in a common data model. Since our goal is a structured result from unstructured inputs, Scrapy provides an Item class which we can use to define how our scraped data should be structured and what fields it should have.
Items: Items are the data that is extracted from selectors in a common data model. Since our goal is a structured result from unstructured inputs, Scrapy provides an Item class which we can use to define how our scraped data should be structured and what fields it should have.
import scrapy
class Article(scrapy.Item):
headline = scrapy.Field()
...
Reddit-less front page
Suppose we love the images posted to Reddit, but don’t want any of the comments or self posts. We can use Scrapy to make a Reddit Spider that will fetch all the photos from the front page and put them on our own HTML page which we can then browse instead of Reddit.To start, we’ll create a
RedditSpider
which we can use traverse the front page and handle custom behavior.import scrapy
class RedditSpider(scrapy.Spider):
name = 'reddit'
start_urls = [
'https://www.reddit.com'
]
Above, we’ve defined a
At this point, we’ll need to begin defining our parsing logic. We need to figure out an expression that the
By looking at Reddit, we can see that external links are included on the homepage directly next to the post’s title. We’ll update
RedditSpider
, inheriting Scrapy’s Spider. We’ve named it reddit
and have populated the class’ start_urls
attribute with a URL to Reddit from which we’ll extract the images.At this point, we’ll need to begin defining our parsing logic. We need to figure out an expression that the
RedditSpider
can use to determine whether it’s found an image. If we look at Reddit’s robots.txt file,
we can see that our spider can’t crawl any comment pages without being
in violation of the robots.txt file, so we’ll need to grab our image
URLs without following through to the comment pages.By looking at Reddit, we can see that external links are included on the homepage directly next to the post’s title. We’ll update
RedditSpider
to
include a parser to grab this URL. Reddit includes the external URL as a
link on the page, so we should be able to just loop through the links
on the page and find URLs that are for images.class RedditSpider(scrapy.Spider):
...
def parse(self, response):
links = response.xpath('//a/@href')
for link in links:
...
In a parse method on our
RedditSpider
class,
I’ve started to define how we’ll be parsing our response for results.
To start, we grab all of the href attributes from the page’s links using
a basic XPath selector. Now that we’re enumerating the page’s links, we can start to analyze the links for images.def parse(self, response):
links = response.xpath('//a/@href')
for link in links:
# Extract the URL text from the element
url = link.get()
# Check if the URL contains an image extension
if any(extension in url for extension in ['.jpg', '.gif', '.png']):
...
To actually access the text information from the link’s href attribute, we use Scrapy’s
.get()
function
which will return the link destination as a string. Next, we check to
see if the URL contains an image file extension. We use Python’s any()
built-in
function for this. This isn’t all-encompassing for all image file
extensions, but it’s a start. From here we can push our images into a
local HTML file for viewing.def parse(self, response):
links = response.xpath('//img/@src')
html = ''
for link in links:
# Extract the URL text from the element
url = link.get()
# Check if the URL contains an image extension
if any(extension in url for extension in ['.jpg', '.gif', '.png']):
html += '''
< a href="{url}" target="_blank">
< img src="{url}" height="33%" width="33%" />
< /a>
'''.format(url=url)
# Open an HTML file, save the results
with open('frontpage.html', 'a') as page:
page.write(html)
# Close the file
page.close()
To start, we begin collecting the HTML file contents as a string which will be written to a file called
As our RedditSpider’s parser finds images it builds a link with a preview image and dumps the string to our
But, it looks like it contains all of the images from Reddit’s front page – not just user-posted content. Let’s update our parse command a bit to blacklist certain domains from our results.
If we look at
frontpage.html
at the end of the process. You’ll notice that instead of pulling the image location from the ‘//a/@href/‘
, we’ve updated our links selector to use the image’s src attribute: ‘//img/@src’
. This will give us more consistent results, and select only images.As our RedditSpider’s parser finds images it builds a link with a preview image and dumps the string to our
html
variable.
Once we’ve collected all of the images and generated the HTML, we open
the local HTML file (or create it) and overwrite it with our new HTML
content before closing the file again with page.close()
. If we run scrapy runspider reddit.py
, we can see that this file is built properly and contains images from Reddit’s front page.But, it looks like it contains all of the images from Reddit’s front page – not just user-posted content. Let’s update our parse command a bit to blacklist certain domains from our results.
If we look at
frontpage.html
, we can see that most of Reddit’s assets come from redditstatic.com and redditmedia.com. We’ll just filter those results out and retain everything else. With these updates, our RedditSpider
class now looks like the below:import scrapy
class RedditSpider(scrapy.Spider):
name = 'reddit'
start_urls = [
'https://www.reddit.com'
]
def parse(self, response):
links = response.xpath('//img/@src')
html = ''
for link in links:
# Extract the URL text from the element
url = link.get()
# Check if the URL contains an image extension
if any(extension in url for extension in ['.jpg', '.gif', '.png'])\
and not any(domain in url for domain in ['redditstatic.com', 'redditmedia.com']):
html += '''
< a href="{url}" target="_blank">
< img src="{url}" height="33%" width="33%" />
< /a>
'''.format(url=url)
# Open an HTML file, save the results
with open('frontpage.html', 'w') as page:
page.write(html)
# Close the file
page.close()
We’re simply adding our domain whitelist to an exclusionary
any()
expression. These statements could be tweaked to read from a separate configuration file, local database, or cache – if need be.
Want to Code Faster?
Kite
is a plugin for PyCharm, Atom, Vim, VSCode, Sublime Text, and IntelliJ
that uses machine learning to provide you with code completions in real
time sorted by relevance. Start coding faster today.
Extracting Amazon price data
If you’re running an ecommerce website, intelligence is key. With Scrapy we can easily automate the process of collecting information about our competitors, our market, or our listings.For this task, we’ll extract pricing data from search listings on Amazon and use the results to provide some basic insights. If we visit Amazon’s search results page and inspect it, we notice that Amazon stores the price in a series of divs, most notably using a class called
.a-offscreen
. We can formulate a CSS selector that extracts the price off the page:prices = response.css('.a-price .a-offscreen::text').getall()
With this CSS selector in mind, let’s build our
AmazonSpider
.import scrapy
from re import sub
from decimal import Decimal
def convert_money(money):
return Decimal(sub(r'[^\d.]', '', money))
class AmazonSpider(scrapy.Spider):
name = 'amazon'
start_urls = [
'https://www.amazon.com/s?k=paint'
]
def parse(self, response):
# Find the Amazon price element
prices = response.css('.a-price .a-offscreen::text').getall()
# Initialize some counters and stats objects
stats = dict()
values = []
for price in prices:
value = convert_money(price)
values.append(value)
# Sort our values before calculating
values.sort()
# Calculate price statistics
stats['average_price'] = round(sum(values) / len(values), 2)
stats['lowest_price'] = values[0]
stats['highest_price'] = values[-1]
Stats['total_prices'] = len(values)
print(stats)
A few things to note about our
AmazonSpider
class: convert_money(): This
helper simply converts strings formatted like ‘$45.67’ and casts them
to a Python Decimal type which can be used for computations and avoids
issues with locale by not including a ‘$’ anywhere in the regular
expression. getall(): The .getall()
function is a Scrapy function that works similar to the .get()
function we used before, but this returns all the extracted values as a list which we can work with. Running the command scrapy runspider amazon.py
in the project folder will dump output resembling the following:{'average_price': Decimal('38.23'), 'lowest_price': Decimal('3.63'), 'highest_price': Decimal('689.95'), 'total_prices': 58}
It’s
easy to imagine building a dashboard that allows you to store scraped
values in a datastore and visualize data as you see fit.
Considerations at scale
As you build more web crawlers and you continue to follow more advanced scraping workflows you’ll likely notice a few things:- Sites change, now more than ever.
- Getting consistent results across thousands of pages is tricky.
- Performance considerations can be crucial.
Sites change, now more than ever
On occasion, AliExpress for example, will return a login page rather than search listings. Sometimes Amazon will decide to raise a Captcha, or Twitter will return an error. While these errors can sometimes simply be flickers, others will require a complete re-architecture of your web scrapers. Nowadays, modern front-end frameworks are oftentimes pre-compiled for the browser which can mangle class names and ID strings, sometimes a designer or developer will change an HTML class name during a redesign. It’s important that our Scrapy crawlers are resilient, but keep in mind that changes will occur over time.Getting consistent results across thousands of pages is tricky
Slight variations of user-inputted text can really add up. Think of all of the different spellings and capitalizations you may encounter in just usernames. Pre-processing text, normalizing text, and standardizing text before performing an action or storing the value is best practice before most NLP or ML software processes for best results.Performance considerations can be crucial
You’ll want to make sure you’re operating at least moderately efficiently before attempting to process 10,000 websites from your laptop one night. As your dataset grows it becomes more and more costly to manipulate it in terms of memory or processing power. In a similar regard, you may want to extract the text from one news article at a time, rather than downloading all 10,000 articles at once. As we’ve seen in this tutorial, performing advanced scraping operations is actually quite easy using Scrapy’s framework. Some advanced next steps might include loading selectors from a database and scraping using very generic Spider classes, or by using proxies or modified user-agents to see if the HTML changes based on location or device type. Scraping in the real world becomes complicated because of all the edge cases, Scrapy provides an easy way to build this logic in Python.
This post is a part of Kite’s new series on Python. You can check out the code from this and other posts on our GitHub repository.
This Blog originally appeared on Kite
What is continuous integration?
Continuous integration (CI) is a process in which whenever a developer check-in code in shared repository, the whole project is automatically build and tested, in order to find integration issues.
Why continuous integration is required if we can do this task manually also?
If we talk about the current software development system, as soon as the developer develops the code he pushes the code in the shared repository and at the same time when he is integrating two module, he wants that whether the built is stable or not. He is performing unit testing, whereas the integration testing is done by another team (testing team) and then the code is finally deployed. If we do this manually we need to wait for the testing team to test the whole build and give a go ahead if the built is successful or else again developer has to fix the issue if the built is not successful and has to wait for the go ahead from the testing team. So in order to automate the whole system, Continuous integration comes into play.
Continuous integration reveals the integration issues at very early stage.
How CI works?
As a developer I check-in the code to my local repository (suppose it may be Git). Once all the check-in has been done, I will push the entire code to the remote repository (suppose it may be Git Hub), where all the developer at the end of the day pushes their code. There is a CI server which is actually tracking the remote repository. Suppose anything new happened in remote repository whether it is addition of any code or modification or any new build has come into the repository than this CI server will actually going to perform the compilation, unit testing, execute the regression testing and publish the reports. Here, build management tool such as Maven/ ANT is also configured with CI tools to perform the task.
Benefits of CI
- Continuously integrates the whole project, thereby eliminating the Integration phase.
- Catches Issues at very early stage.
- Reduces time for de-bugging, as issues are caught at each integration.
- Help to deliver the product more rapidly.
Famous CI tools
- Jenkins
- Hudson
- Team city
- Team foundation server
- GitLab
In above Jenkins and Hudson are free CI tools.
What is Jenkins?
Jenkins is an open-source Continuous integration tool/server, which is used to automate all sort of task related to building, testing, delivering or deploying the whole project whenever needed.
Jenkins is originally created as Hudson, but after dispute with Oracle, this project is separated by open source community as “Jenkins”.
Installing and Running Jenkins
Use following steps to install and run Jenkins on local environment:
- Open CMD and run following command to start the server : java –jar jenkins.war –httpPort=8080.
Note: - If 8080 port is in use give the port number such as 9090 or 8090.
Configuring Jenkins Plugin
More than 600 plug-ins are available to customize Jenkins as per project requirements.
To Manage Jenkins plug-ins use following steps:
- Open Jenkins in browser
- Click on “Manage Jenkins”
- Click on “Manage Plug-ins”.
- Go to “Available Plug-ins” and choose the plug-in you need.
Note: - The important plugin that should be installed are CVS plugin, Git plugin and GitClient plugin.
Jenkins configuration wizard
In order to connect Jenkins with other tools (i.e java, maven etc.) first we need to configure the Jenkins. Use following steps to configure Jenkins:
- Open Jenkins in any browser.
- Click on “Manage Jenkins”
- Click on “Global tool configuration”
- Now configure following :
- Java :
a) Set the path of “Java_Home” variable as of System variable. (e.g. C:\Program Files\Java\jdk1.8.0_51)
- Git :
a) Set the path of Git installation directory (e.g. C:\Program Files\Git\bin\git.exe)
- Maven:
a) Set the path of Maven installation directory as of Maven_Home variable ( e.g. E:\maven2\apache-maven-3.0.5-bin\apache-maven-3.0.5)
5. Now “save” the above settings.
MAVEN
Apache Maven is a software project management and build management tool for java frameworks.
It helps in maintaining and managing the projects (development code, test cases, and frameworks).
Its development process is very similar to ANT. However, it is much advanced than ANT.
Why MAVEN?
1) Central repository to get dependencies.
Maven has its own repository site (Maven Repository) where you will get all the jar files and libraries
of software present in market so far. So whenever you build your project with maven configuration you
may need not provide any jars for your projects, the maven project which you have build will
automatically connect to this maven repository site and look for the jars you want and download the
jars from the site and place it in your project build path.
2) Maintaining common structure across the organization.
In a company there are multiple teams working on different frameworks and each team defines there
framework with different formats and different structure. This lead to inconsistency in defining the
framework in company. To maintain the consistency we need a common framework structure. Here
the role of maven come in play. Maven automatically suggest some template, for test cases it suggest
one template and for java development it suggest one more template. So simply we can convert our
project using maven and get that template and introduce all our test cases according to that template or
inject our code according to that template.
3) Flexibility in integration with Continuous integration tools.
Suppose you need to execute 10000 test cases on a single night than you will not go to each and every
test case and execute them, for this you need some continuous integration tools such as Jenkins to do the
task. To make the project compatible with Jenkins you need a build management tool for your
framework. Here is the role of maven comes into play.
4) Plugins for test framework execution.
Maven provides excellent plugins for testing.
How to Install Maven
Pre-requisites: Install java in your system.
1) Go to the official website (Maven Download).
2) Click on bin.zip for windows download and bin.tar for Mac.
3) Unzip the folder.
4) Set the Maven home in your system variables just like you have set your java.
5) Open cmd and type “mvn --version” and press enter.
If you get this it means you have successfully configured Maven in your machine.
Note: - If you see something like Maven is not recognized or some external error, it means you have
not properly set the maven in your system variable.
Understanding Maven terminologies
1) GroupId: It will identify your project uniquely across all projects.
2) Artifact: It is a file usually a JAR that gets deployed to the maven repository.
3) Archetype:generate: Generates a new project from archetype.
Suppose I need selenium JAR in my code.
<!-- https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-java -->
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.6.0</version>
</dependency>
Here group id represents that I need jar files of selenium and artifactid tells us that I need jar files of
selenium for java codes and version tells about the version of the jar file.
selenium for java codes and version tells about the version of the jar file.
Creating a Maven Project using cmd
The command for creating the Maven project is:
mvn archetype:generate -DgroupId=com.mycompany.app -DartifactId=
my-app -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4
-DinteractiveMode=false
Here,
Artifactid stand as a project name
Maven-archetype-quickstart is the template that is mostly used for test framework development.
Hence, this command will create a dummy maven project with neat hierarchy skeleton for you.
Note: - if you want to use this maven project with eclipse you cannot use directly as this project does
not contain .classpath file. For creating .classpath file outside the eclipse you have to perform certain
steps.
not contain .classpath file. For creating .classpath file outside the eclipse you have to perform certain
steps.
1) Open cmd and reach the directory where your maven project xml file is present.
2) Now write “mvn eclipse:eclipse” and press enter.
Now you can import this project on eclipse.
By default all the current version of eclipse are having maven plugins
Where the jars does installed in the machine?
It is installed in .m2 repository in the system. As soon as we install maven in the system it creates a local
repository named as .m2 and stores all the jars here only.
Blog Archive
- March 2020 (1)
- December 2019 (1)
- January 2019 (2)
Labels
latest posts
latest comments
-
This blog is to provide you with daily outfit ideas and share my personal style.
-
Hi, my name is Aditi. I'mstudent originally from Dehradun, India.