Welcome to scrapex’s documentation!¶

Scrapex is a simple web scraping framework. Built on top of requests and lxml, supports Python 2 and Python 3.

>>> from scrapex import Scraper
>>> s = Scraper(use_cache = True)
>>> doc = s.load('https://github.com/search?q=scraping')
>>>
>>> print(doc.extract("//h3[contains(text(),'results')]").strip())
59,256 repository results
>>>
>>> listings = doc.query("//ul[@class='repo-list']/li")
>>> print('number of listings on first page:', len(listings) )
number of listings on first page: 10
>>>
>>> for listing in listings[0:3]:
...     print('repo name: ',listing.extract(".//div[contains(@class,'text-normal')]/a"))
...
repo name:  scrapinghub/portia
repo name:  scrapy/scrapy
repo name:  REMitchell/python-scraping
>>>

Key Features¶

Easy to parse data points from html documents using XPATH, regular expression, or string subtraction

Easy to parse street address into components (street, city, state, zip code)

Easy to parse person name into components (prefix, first name, middle name, last name, suffix)

Easy to save results to CSV, Excel, JSON files

HTML cache:

quickly response to any parsing mistakes/changes without having to send HTTP rquests again.

easy to resume a broken scrape without having to re-download the downloaded html files.

Ramdom proxy rotation

Random user agent rotation

Indices and tables¶