Welcome to scrapex’s documentation!¶
Scrapex is a simple web scraping framework. Built on top of requests and lxml, supports Python 2 and Python 3.
>>> from scrapex import Scraper
>>> s = Scraper(use_cache = True)
>>> doc = s.load('https://github.com/search?q=scraping')
>>>
>>> print(doc.extract("//h3[contains(text(),'results')]").strip())
59,256 repository results
>>>
>>> listings = doc.query("//ul[@class='repo-list']/li")
>>> print('number of listings on first page:', len(listings) )
number of listings on first page: 10
>>>
>>> for listing in listings[0:3]:
... print('repo name: ',listing.extract(".//div[contains(@class,'text-normal')]/a"))
...
repo name: scrapinghub/portia
repo name: scrapy/scrapy
repo name: REMitchell/python-scraping
>>>
Key Features¶
- Easy to parse data points from html documents using XPATH, regular expression, or string subtraction
- Easy to parse street address into components (street, city, state, zip code)
- Easy to parse person name into components (prefix, first name, middle name, last name, suffix)
- Easy to save results to CSV, Excel, JSON files
- HTML cache:
- quickly response to any parsing mistakes/changes without having to send HTTP rquests again.
- easy to resume a broken scrape without having to re-download the downloaded html files.
- Ramdom proxy rotation
- Random user agent rotation