1
2
3
import threading
def start_thread(func, name=None, args = []):
    threading.Thread(target=func, name=name, args=args).start()

How to start a background thread in Python to download images from an API.

Why background threads?

When scraping data or simply requesting an API as the one Instagram is using at the moment, it might happen that while retrieving some data you want to request some additional files. Too abstract? Let’s make it very concrete.

A real world example

Fast Instagram Scraper retrieves the small data batches Instagram delivers. Usually, when requesting data for a Facebook location (location ID) it delivers a json with around 50 posts. For hashtags it can be up to 150 posts each. Let’s assume you want to get the latest 500 posts metadata and pictures for the location ID 123456789.

When you run Fast Instagram Scraper after a few seconds the first batch should be there. Depending on the break you make in between the requests, let’s say after 10 seconds you retrieve another batch of 50 posts. So after 40 seconds you should have mined around 200 posts or more accurately the metadata of 200 posts.

1
2
Seconds 10  20  30  40 ...
Batch    1   2   3   4 ...

While our main script is running and busy with requesting a batch, pausing some seconds, requesting another batch etc. we want to run a background thread every time, a batch is retrieved to request an URL to download a picture asynchronously.

Asynchronous function

Python threading

Asynchronity is easily implemented with Python’s threading module. To start a function aside from the main script a oneliner is sufficient.

1
2
import threading
threading.Thread(target=func, name=name, args=args).start()

Wrapped in a function it looks slightly nicer.

1
2
3
import threading
def start_thread(func, name=None, args = []):
    threading.Thread(target=func, name=name, args=args).start()

If we run the following script, it becomes clear that there are two threads runnning: the main thread and a side thread.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import time
def print_hi():
  time.sleep(1)
  print("hi")

print("Start main thread")
start_thread(print_hi)
print("End main thread")

# Output 
Start main thread
End main thread
hi

Download an image over Tor on a background thread

A more complex, working example to download a picture asynchronously over Tor from a dictionary containing the filename and the URLs with a timeout could look like this. These functions are used in Fast Instagram Scraper.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from torpy.http.requests import TorRequests
import shutil
from tqdm import tqdm
from func_timeout import func_timeout, FunctionTimedOut, func_set_timeout
import pandas as pd 
import json
import threading

# read example file 
infile = "path/to/sample.json"
with open(infile, encoding="utf8") as f:
    d = json.load(f)
df = pd.json_normalize(d["edge_location_to_media"]["edges"])

# define request header
headers = {}
headers['User-agent'] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"

# create sample dict
img_name = df["node.shortcode"][:5] # name
img_link = df["node.display_url"][:5] # link
img_dict = dict(zip(img_name,img_link))

def start_thread(func, name=None, args = []):
    threading.Thread(target=func, name=name, args=args).start()

def download_img(session, img_name, img_link):
    try:
        img_req = session.get(img_link,headers = headers, stream=True) # fire request
        with open('{}.png'.format(img_name), 'wb') as out_file: # saves png file always, if status is 200 file will have 1kb only
            shutil.copyfileobj(img_req.raw, out_file)
    except:
        print("Could't download image {} with link {}.".format(img_name, img_link))
        return "download_failed"

def tor_img_download_loop(img_dict): 
    with TorRequests() as tor_requests:
        with tor_requests.get_session() as sess:
            print("Image download tor circuit built.")
            for i, (k, v) in tqdm(enumerate(img_dict.items())):
                download_img(sess,k,v)
                # remaining_keys = list(img_dict.keys())[i:]
                # remaining_dict = {k: img_dict[k] for k in remaining_keys}

def download_images(img_dict):    
    try: 
        func_timeout(img_tor_timeout, tor_img_download_loop, args = [img_dict])
    except FunctionTimedOut:
        print ("Torsession terminated after {} seconds tor_timeout.".format(img_tor_timeout))
        return
        
img_tor_timeout = 300

start_thread(download_images, args =[img_dict])

More

Like this post? See the repo and find a jupyter notebook with more interesting functions.