Scraping Websites With Taskerqscraping Websites Made Easy
Web Scraping with Python Made Easy
Learn how to scrape websites with Python
Beautiful Soup is a Python library that facilitate scraping information from a website. In this post, I would like to show you some of the basics for you to start scraping website by your own. We will build a Python Web Scraper step by step. It is easier than it sounds.
Why Python Web scraping?
Web scraping consists of extracting information from a website through a program or script. Scraping helps automate data extraction and it is much faster than if we had to extract information manually. It can really save hours of manual and tedious work.
For example, in case we would like to get a list containing titles of all products uploaded in the eBay "Wireless Headphones" category, we could write a Python script and automate the task using Beautiful soup.
How to install Beautiful Soup?
Beautiful Soup can be installed by running the pip command in the terminal. Check the official documentation for additional details.
pip install beautifulsoup4
Before starting writing our code, note that while scraping public data is not illegal, we should avoid making hundred of requests per second to a website since it may overload the site server. In addition, it is always advisable to check the terms of the website that you intend to scrap to understand if they allow scraping at all.
Creating our Python Scraper
Ok, let's start. We will scrape cryptonewsandprices.me which is a website containing a repository of Crypto news. Our goal is to extract the title and the date of publication from the site.
First of all, we should inspect the html code of the webpage to identify which elements we would like to extract from the site. The page that we scrap in this post looks as below:
To view the page source of the site, right click and select "View Page Source". Then, we are able to see the html source code of the site that we will parse with Beautiful Soup. By looking at below extract of the html source, we can see that our title is surrounded by a h5 tag with class "card-title". We will use these identifiers to scrap the information with the help of Beautiful Soup and its powerful parsers.
First thing we need to do is to import our libraries requests and BeautifulSoup . Since we need to send a request to the page to be scrapped, we need to use the requests library. Then, once we get a response from the site, we store it in a variable call " mainContent " that we will later parse:
import requests
from bs4 import BeautifulSoup mainContent = requests.get("https://cryptonewsandprices.me/") print(mainContent.text)
The problem that we have is that the request that we get with requests.get is not very user friendly, and therefore, we need to transform it into something more understandable. Note that our mainContent variable contains the whole html code of the site.
Scraping information from one element
Let's extract now the news title. First we need to transform our string within the mainConent variable into a " soup" that Beautiful Soup parsers can understand (and parse). It is possible to select different parsers to read the data. In this post, I use " lxml " since it is one of the most commons and faster parsers.
In below lines of code, Soup contains the html code of the whole page that we targeted with our get request. Then, Beautiful Soup lxml parser let us extract the desired information from the html source code.
Beautiful Soup provides a few methods in order to extract text within html tags, classes or other elements in the website. Since, we know that the title of each of the news is using a h5 html tag and the class card-title, we can use "find" to locate them in the page and extract the value to our title variable. In addition, we use "get_text()" to extract only the text inside the html tag h5 and class "card-title" without html mark-up.
soup = BeautifulSoup(mainContent.text,'lxml')
title = soup.find('h5', class_='card-title').get_text()
print(title)
Great, we have printed the title of the first news article in the page. Now, let's extract information regarding when this article was published. For that, we need first to have a look at the site to understand with which html element we can identify the " Published Ago" information.
As shown in below image, we can identify the element by the " small" tag and " text-info" class. Once again, we can use the method find to locate and extract the object from our site.
published = soup.find('small', class_='text-info').get_text().strip() print(published)
Perfect, now we have the published information and the tile from the latest available article.
Scraping information from multiple elements
It would be nice to get all titles and published information from all news instead of only one news elem. For that, BS has a method call find_all . It works similar to find :
titleall = soup.find_all('h5', class_='card-title')
print(titleall) ##printed answer [<h5 class="card-title">Ex-UFC Fighter & Bitcoin Bull Ben Askren: XRP is a Scam </h5>, <h5 class="card-title">Opporty founder calls SEC's ICO lawsuit 'grossly overstated' and 'untruthful' in an open letter </h5>,...]
What Beautiful Soup find all method returns is a list containing all news titles included in the website. Each of the elements in the list is a title. However, we have the html h5 tags as part of the result.
Get_text method that we used before to extract the text is not going to work with a list. Therefore, in order to get each of the titles without the html tags, we can loop through the list, and then apply get_text to each iteration of the list to append it in a new list called title_list:
title_list =[] for item in titleall:
individualtitle = item.get_text()
title_list.append(individualtitle) print(title_list)
Great, now we got back our titles without the html tags, only text.
Finalising our Python Web Scraper
As the last step, it would be interesting if we could extract the title and write it to a csv file. For that, we can use the csv library and the writer method:
import csv with open('pythonscraper.csv','w') as csvfile:
writer = csv.writer(csvfile)
for item in title_list:
writer.writerow([item])
And just like this, we get a list of title news in a csv file. You can try now by your own and extract any other information.
Do not hesitate to write a comment here if something is not clear or watch below Youtube video where I go through the script line by line.
Source: https://towardsdatascience.com/web-scraping-with-python-made-easy-f069ffaf7754
0 Response to "Scraping Websites With Taskerqscraping Websites Made Easy"
Post a Comment