Welcome to scrapex’s documentation!

Scrapex is a simple web scraping framework. Built on top of requests and lxml, supports Python 2 and Python 3.

>>> from scrapex import Scraper
>>> s = Scraper(use_cache = True)
>>> doc = s.load('https://github.com/search?q=scraping')
>>>
>>> print(doc.extract("//h3[contains(text(),'results')]").strip())
59,256 repository results
>>>
>>> listings = doc.query("//ul[@class='repo-list']/li")
>>> print('number of listings on first page:', len(listings) )
number of listings on first page: 10
>>>
>>> for listing in listings[0:3]:
...     print('repo name: ',listing.extract(".//div[contains(@class,'text-normal')]/a"))
...
repo name:  scrapinghub/portia
repo name:  scrapy/scrapy
repo name:  REMitchell/python-scraping
>>>

Key Features

  • Easy to parse data points from html documents using XPATH, regular expression, or string subtraction
  • Easy to parse street address into components (street, city, state, zip code)
  • Easy to parse person name into components (prefix, first name, middle name, last name, suffix)
  • Easy to save results to CSV, Excel, JSON files
  • HTML cache:
    • quickly response to any parsing mistakes/changes without having to send HTTP rquests again.
    • easy to resume a broken scrape without having to re-download the downloaded html files.
  • Ramdom proxy rotation
  • Random user agent rotation

Indices and tables