Web Scraping Notes

Web scraping is the extraction or harvesting of data from websites and storing it various format (e.g database, json, spreadsheet, text files).

Popular Python Web Scraping Libraries

requests: allow you to access web pages in its raw form
BeautifulSoup: a parsing library using different parser (e.g. html,lxml,xml)
lxml: production-quality HTML and XML parsing library
selenium: automate browsers control
scrapy: a complete spider that can crawl through entire websites in a systematic way
mechanicalsoup: built around beautifulsoup, but if you need to check a few boxes or enter some text and you don’t want to build your own crawler for such task this one does such tasks

Others:

feedparser: helpful if you are working on an xml, atom feeds, or RSS.
lassie: retrieve basic content like a description, title, keywords, or a list of images from a webpage
robobrowser: browsing the web without a standalone web browser, can fetch a page, click on links and buttons, and fill out and submit forms

Popular data scraping tools

Open-Source Tools

BeautifulSoup transforms a complex HTML document into a complex tree of python objects. You’ll deal only with four kinds of this objects.

Tag: it corresponds to an XML or HTML tag in the original document Important topic to explore: 1.1 Navigating the tree 1.2 Searching the tree
NavigableString: A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:
BeautifulSoup: The BeautifulSoup object itself represents the document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.
Comment: Tag, NavigableString, and BeautifulSoup cover almost everything you’ll see in an HTML or XML file, but there are a few leftover bits. The only one you’ll probably ever need to worry about is the comment. The Comment object is just a special type of NavigableString.

Cryptocurrency price scraping (BeautifulSoup)
Lotto winning numbers scraping (BeautifulSoup)
Map automation download (BeautifulSoup)
Website list scrapping (BeautifulSoup)
- medium list of top publishers
- cryptocurrency capitalization list
Filling google sign-up form (selenium)
Arxiv paper download (feedparser)
Tripadvisor Hotel Reviews (Philippines Area) (scrapy)