Picture by Polina Razorilova Photo by Polina Razorilova on Unsplash

Making use of Python and Pandas to quickly process mined social media data of instagram-scraper. Get most common hashtags!

Instagram-Scraper

As mentioned in an earlier post, instagram-scraper is currently the best Python-based Instagram data-scraper around there. It reliably scrapes pictures, videos and its metadata. A simple example for scraping all posts from an Instagram-location (so called “location-IDs”). Adding “–media-types none” only retrieves metadata but doesn’t work any faster.

1
instagram-scraper --include-location --comments --media-metadata --location 252823277

Update October 2020

Due to Instagram’s policy change you need to login with -u username -p password. Otherwise it won’t work since unauthenticated users can’t display posts anymore.

Output

By default, the scraper saves the metadata in a json-file looking like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
    "GraphImages": [
        {
            "dimensions": {
                "height": 1350,
                "width": 1080
            },
            "display_url": "some_instagram_cdn_url",
            "edge_liked_by": {
                "count": 124
            },
            "edge_media_preview_like": {
                "count": 124
            },
            "edge_media_to_caption": {
                "edges": [
                    {
                        "node": {
                            "text": "There is a reason why I´m not on Instagram!"
                        }
                    }
                ]
            },
		...

Directory tree

In this use case, let´s assume, we scraped the metadata of 10 different locations. By default, instagram-scraper saves the metadata in the respective folder named after the location id. In every folder you´ll find one file, named after the location too with .json ending, leading to this tree.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
project
|
└───12345
│     └── 12345.json
|
└───12346
│     └── 12346.json
|
└───12347
│     └── 12347.json
|
...

Get most common hashtags

Now, we are interested in the occurence of hashtags. In particular we want to create a list of the most commonly used hashtags for every location id and at the same time balance the posts by keeping unique hashtags once per post. If someone wants to express an overwhelming love for Berlin by using the same hashtag three times “#berlin #berlin #berlin” it will count one time only as “#berlin”.

Hands-on in Python!

The following script will create an excel-file for every json-file with the most common hashtags per file.

Jupyter Notebook progress bar

This script was written for Jupyter Notebooks. If you are using spyder or bash, change the first line for a simple plain text progress bar.

1
from tqdm import tqdm

This script will get the job done in no time! Even though we commit a crime by using pandas iterrows(), it works reliably and fast. My personal benchmark on an i7 with ~8000 posts split in 10 different files is finished in less than 4 seconds, so not really worth reworking the iterrows() part.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
from tqdm.notebook import tqdm
import json 
import pandas as pd 
from pandas.io.json import json_normalize 
from collections import Counter
import sys
from datetime import datetime
import glob 
import os

# if necessary, change dir
# os.chdir("Instagram/Hashtag-Analysis")

# get all json files in cwd
alljsons = glob.glob("*.json")

# get hashtags from text
def extract_hash_tags(s):
    return list(set(part[1:] for part in s.split() if part.startswith('#')))

# node handling
def indexpos(a):
    try:
        return extract_hash_tags(a[0]["node"]["text"])
    except:
        return

# json normalization
def normalize_json(infile):
    with open(infile, encoding="utf8") as f:
        d = json.load(f)
    wf = json_normalize(d['GraphImages'])
    return wf

# loop through all files 
for i in tqdm(alljsons):
    nf = normalize_json(i)
    maxt = max(nf.taken_at_timestamp)
    mint = min(nf.taken_at_timestamp)
    maxt_str = str(datetime.fromtimestamp(maxt))
    mint_str = str(datetime.fromtimestamp(mint))
    
    # print some stats
    templ = "min: {} | max: {} | count: {} | id: {}"
    print(templ.format(mint_str, maxt_str, str(len(nf)), i.split(".")[0] ))

    # add column with additional text to get further hashtags
    nf["text_hashtags"] = nf["edge_media_to_caption.edges"].apply(lambda x: indexpos(x))

    # get list with all hashtags
    alltags = []
    
    # loop through all rows
    # note that iterrows() is bad pandas usage!
    for index, row in nf.iterrows():
        temptags = []
        
        # if cell is a list, append
        if isinstance(row.tags, list):
            temptags = temptags + row.tags
        if isinstance(row["text_hashtags"], list):
            temptags = temptags + row["text_hashtags"]
        
        # important: use "set" to get every unique hashtag once only per post/row
        alltags = alltags + list(set(temptags))

    # lower for later comparison
    alltagslower = [ht.lower() for ht in alltags]
    
    # export pandas dataframe with tag and count
    mf = pd.DataFrame(Counter(alltagslower).most_common(),columns=["tag","count"])
    outname = i.split(".")[0] + ".xlsx"
    mf.to_excel(outname,index=False)

It will print min and max date, count and the location id like this:

1
2
3
min: 2015-08-07 15:39:05 | max: 2020-05-22 22:03:14 | count: 1110 | id: 123456
min: 2017-09-29 19:12:21 | max: 2020-04-07 23:50:54 | count: 1331 | id: 123457
...

If you’d like some stats for all files just add the following lines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# line 34
all_files_hashtags = []

# line 69
    all_files_hashtags = all_files_hashtags + alltagslower

# line 74 (very end)
af = pd.DataFrame(Counter(all_files_hashtags).most_common(),columns=["tag","count"])
outname = "all_files_hashtags.xlsx"
af.to_excel(outname,index=False)