A personal shortlist of well-working Python-based scrapers in 2020.

Instagram

instagram-scraper is currently the best Python-based web-scraper around there. Reliably scrapes pictures, videos and its metadata. A simple example for scraping all posts from an Instagram-location (its so called “location-IDs”). Add “–media-types none” to only retrieve metadata. Unfortunately doesn’t work any faster.

1
instagram-scraper --include-location --comments --media-metadata --location 252823277

Flickr

flickr-scrape works with an official flickr API key. Must be executed from directory with scraper.py. Optionally you can work with bounding boxes.

1
python scraper.py --search "Bahnhof Zoo Berlin"

Twitter

twitter-scraper is backed-up by a pretty active community and works as straightforward as the previous scrapers.

1
twitterscraper "from:TrumpBabyUK" -l 1000 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json

Data processing in pandas

Python loves data. All of the scrapers above export the data as *.json file and can hence immediately be read into pandas.

1
2
3
import pandas as pd
df = pd.read_json("data.json")
df

Only instagram-scraper exports the json with an unnecessary extra parent node. This can easily be dealt with by the following lines.

1
2
3
4
5
6
7
8
import json
from pandas.io.json import json_normalize

infile = "data.json"
with open(infile, encoding="utf8") as f:
    d = json.load(f)
df = json_normalize(d['GraphImages'])
df