Photo by Polina Razorilova on Unsplash
Making use of Python and Pandas to quickly process mined social media data of instagram-scraper. Get most common hashtags!
Instagram-Scraper
As mentioned in an earlier post, instagram-scraper is currently the best Python-based Instagram data-scraper around there.
It reliably scrapes pictures, videos and its metadata.
A simple example for scraping all posts from an Instagram-location (so called “location-IDs”). Adding “–media-types none” only retrieves metadata but doesn’t work any faster.
1
| instagram-scraper --include-location --comments --media-metadata --location 252823277
|
Update October 2020
Due to Instagram’s policy change you need to login with -u username -p password. Otherwise it won’t work since unauthenticated users can’t display posts anymore.
Output
By default, the scraper saves the metadata in a json-file looking like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| {
"GraphImages": [
{
"dimensions": {
"height": 1350,
"width": 1080
},
"display_url": "some_instagram_cdn_url",
"edge_liked_by": {
"count": 124
},
"edge_media_preview_like": {
"count": 124
},
"edge_media_to_caption": {
"edges": [
{
"node": {
"text": "There is a reason why I´m not on Instagram!"
}
}
]
},
...
|
Directory tree
In this use case, let´s assume, we scraped the metadata of 10 different locations. By default, instagram-scraper saves the metadata in the respective folder named after the location id. In every folder you´ll find one file, named after the location too with .json ending, leading to this tree.
1
2
3
4
5
6
7
8
9
10
11
12
| project
|
└───12345
│ └── 12345.json
|
└───12346
│ └── 12346.json
|
└───12347
│ └── 12347.json
|
...
|
Now, we are interested in the occurence of hashtags. In particular we want to create a list of the most commonly used hashtags for every location id and at the same time balance the posts by keeping unique hashtags once per post.
If someone wants to express an overwhelming love for Berlin by using the same hashtag three times “#berlin #berlin #berlin” it will count one time only as “#berlin”.
Hands-on in Python!
The following script will create an excel-file for every json-file with the most common hashtags per file.
Jupyter Notebook progress bar
This script was written for Jupyter Notebooks. If you are using spyder or bash, change the first line for a simple plain text progress bar.
This script will get the job done in no time! Even though we commit a crime by using pandas iterrows(), it works reliably and fast. My personal benchmark on an i7 with ~8000 posts split in 10 different files is finished in less than 4 seconds, so not really worth reworking the iterrows() part.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
| from tqdm.notebook import tqdm
import json
import pandas as pd
from pandas.io.json import json_normalize
from collections import Counter
import sys
from datetime import datetime
import glob
import os
# if necessary, change dir
# os.chdir("Instagram/Hashtag-Analysis")
# get all json files in cwd
alljsons = glob.glob("*.json")
# get hashtags from text
def extract_hash_tags(s):
return list(set(part[1:] for part in s.split() if part.startswith('#')))
# node handling
def indexpos(a):
try:
return extract_hash_tags(a[0]["node"]["text"])
except:
return
# json normalization
def normalize_json(infile):
with open(infile, encoding="utf8") as f:
d = json.load(f)
wf = json_normalize(d['GraphImages'])
return wf
# loop through all files
for i in tqdm(alljsons):
nf = normalize_json(i)
maxt = max(nf.taken_at_timestamp)
mint = min(nf.taken_at_timestamp)
maxt_str = str(datetime.fromtimestamp(maxt))
mint_str = str(datetime.fromtimestamp(mint))
# print some stats
templ = "min: {} | max: {} | count: {} | id: {}"
print(templ.format(mint_str, maxt_str, str(len(nf)), i.split(".")[0] ))
# add column with additional text to get further hashtags
nf["text_hashtags"] = nf["edge_media_to_caption.edges"].apply(lambda x: indexpos(x))
# get list with all hashtags
alltags = []
# loop through all rows
# note that iterrows() is bad pandas usage!
for index, row in nf.iterrows():
temptags = []
# if cell is a list, append
if isinstance(row.tags, list):
temptags = temptags + row.tags
if isinstance(row["text_hashtags"], list):
temptags = temptags + row["text_hashtags"]
# important: use "set" to get every unique hashtag once only per post/row
alltags = alltags + list(set(temptags))
# lower for later comparison
alltagslower = [ht.lower() for ht in alltags]
# export pandas dataframe with tag and count
mf = pd.DataFrame(Counter(alltagslower).most_common(),columns=["tag","count"])
outname = i.split(".")[0] + ".xlsx"
mf.to_excel(outname,index=False)
|
It will print min and max date, count and the location id like this:
1
2
3
| min: 2015-08-07 15:39:05 | max: 2020-05-22 22:03:14 | count: 1110 | id: 123456
min: 2017-09-29 19:12:21 | max: 2020-04-07 23:50:54 | count: 1331 | id: 123457
...
|
If you’d like some stats for all files just add the following lines:
1
2
3
4
5
6
7
8
9
10
| # line 34
all_files_hashtags = []
# line 69
all_files_hashtags = all_files_hashtags + alltagslower
# line 74 (very end)
af = pd.DataFrame(Counter(all_files_hashtags).most_common(),columns=["tag","count"])
outname = "all_files_hashtags.xlsx"
af.to_excel(outname,index=False)
|