ResagratiaResagratiaResagratia
+234 806 299 1199
contact@resagratia.com

Visualizing WhatsApp Chats using Python and Power BI

  • Home
  • Business
  • Visualizing WhatsApp Chats using Python and Power BI
Resagratia WhatsApp Analytics

INTRODUCTION

More than 1 billion people in the world in over 180 countries make use of Whatsapp to stay in touch with friends and family and carry out business and other social interactions.  It allows users share messages in the form of text, pictures, videos and voice notes and is thus a great source of data for analysis.

In this tutorial we will go through exporting data from the Whatsapp platform, importing it into Python, transforming and cleaning this data and finally creating an awesome interactive dashboard using Power BI.

PREREQUISITES

Before we get started you’ll need to have the following packages installed in your Python environment:

  1. pandas
  2. matplotlib
  3. regex
  4. Dateparser
  5. numpy

You can easily install these packages by running the  pip install command in your Jupyter notebook or on the Command line. You can also use conda install  in Jupyter or Anaconda prompt if you are working on an Anaconda distribution of Python. For example, to install pandas you run:

pip install pandas

or

conda install pandas

Once you have all these packages installed we can then delve into Data Collection.

Learn Data Analysis with Excel

EXTRACTING WHATSAPP CHAT DATA

For this tutorial, I’ll be working with data from Resagratia Whatsapp Group Chat but you can make use of your own data. Simply open the Whatsapp app on your mobile, navigate to the chat you want to analyze (I’ll suggest a group chat for its wide range of visualization options), click on the vertical ellipsis (the three small dots beside the phone symbol), click on More, select Export chat and choose WITHOUT MEDIA. Including media takes ages to export and we don’t want that. Then select the Email option and export.

Download the exported chat data from your email and save it as Chat.txt in a new folder specially created for this project. The chat data is in text file format and we will use Python to clean and transform this data.

DATA CLEANING AND PREPARATION

Our chat data is in text file format and we can’t do much with it. We’ll need to clean and transform this text file into a dataframe to enable us analyze it, derive insights and create awesome visuals. We’ll make use of the pandas library in Python to do this.

Open up Jupyter notebook (or your preferred IDE), navigate to the folder which contains our text file, and create a new notebook.

First, we need to import some packages we need for our analysis:

import pandas as pd
import numpy as np
import re
import dateparser
from collections import Counter
import matplotlib.pyplot as plt
plt.style.use('ggplot')

Next, we need to load our Chat.txt into Python and read it. We will do this using the function below:

def read_file(file):
    '''Reads Whatsapp text file into a list of strings'''

  x = open(file,'r', encoding = 'utf-8') #Opens the text file into variable x but the variable cannot be explored yet
    y = x.read() #By now it becomes a huge chunk of string that we need to separate line by line
    content = y.splitlines() #The splitline method converts the chunk of string into a list of strings
    return content

Next, we use the read_file function to read our Chat.txt file into Python and save it to a variable named chat.

chat = read_file('Chat.txt')
len(chat)

We will get the following output showing us the length of lines in chat :

Some messages are not useful for our analysis and we will need to remove them. An example is the system generated message showing that someone just joined or left the group chat. Let us save all “joined” messages into a variable named join.

join = [line for line in chat if  "joined using this" in line]
join

This gives us the following output:

We will clean our chat data by removing all messages in join and also by removing all empty lines (lines that contain no message) by running the code below:

#Remove new lines
chat = [line.strip() for line in chat]
print("length of chat is:")
print(len(chat))

#Clean out the join notification lines
clean_chat = [line for line in chat if not "joined using this" in line]

#Further cleaning
#Remove empty lines
clean_chat = [line for line in clean_chat if len(line) > 1]
print("length of clean_chat is:")
print(len(clean_chat))

This gives the output:

We have 2012 lines in our clean_chat variable (This should be different for your own data).

Next, we do the same for messages that show members who left.

#Drop 'left-ers'
left = [line for line in clean_chat if line.endswith("left")]
left

For the RESAGRATIA Whatsapp Chat, no members left in the period covered by this analysis as shown below:

If your data contains messages showing that some members left, then run this code:

#Clean out the left notification lines
clean_chat = [line for line in clean_chat if not line.endswith("left")]
print(len(clean_chat))

Next, we will group all the lines in clean_chat into messages and store in a variable named msgs. Every message begins with a date e.g 12/12/19 and we will use this property in grouping. We will make use of the regex package in Python by running the code below:

#Merge messages that belong together
msgs = [] #message container
pos = 0 #counter for position of msgs in the container
"""
Flow:
For every line, see if it matches the expression which is starting with the format "number(s)+slash" eg "12/"
If it does, it is a new line of conversion as they begin with dates, add it to msgs container
Else, it is a continuation of the previous line, add it to the previous line and append to msgs, then pop previous line.
"""
for line in clean_chat:
    if re.findall("\A\d+[/]", line):
        msgs.append(line)
        pos += 1
    else:
        take = msgs[pos-1] + ". " + line
        msgs.append(take)
        msgs.pop(pos-1)
len(msgs)

This gives the output below:

We have a total of 1472 unique messages (This should be different for your data). Let’s look at the content of our msgs data:

msgs[0:10]


Looking for on-demand Excel course in data analysis and data visualisation? Check out this 3-hour crash course designed for you..

Get 90% off when you register with this Link.


EXTRACT DATE COMPONENTS AND MESSAGE CONTENT

Next, we will need to extract Date, Time, Name and Message Content from our msgs data using the codes below:

time = [msgs[i].split(',')[1].split('-')[0] for i in range(len(msgs))]
time = [s.strip(' ') for s in time] # Remove spacing
print("length of time is:")
print(len(time))
time

This gives the output:

We do the same for date, name and content.

date = [msgs[i].split(',')[0] for i in range(len(msgs))]
len(date)

name = [msgs[i].split('-')[1].split(':')[0] for i in range(len(msgs))]
len(name)

content = []
for i in range(len(msgs)):
  try:
    content.append(msgs[i].split(':')[2])
  except IndexError:
    content.append('Missing Text')
len(content)

Now we can finally use the pandas library to merge our date, time, name and content data into a Dataframe named df using the code below:

df = pd.DataFrame(list(zip(date, time, name, content)), columns = ['Date', 'Time', 'Name', 'Content'])
df

This gives the output below:

Look at the first two rows of data. The Content column shows “Missing Text.” Those are system messages and we’ll need to drop them. We can do this using:

df = df[df["Content"]!='Missing Text']
df.reset_index(inplace=True, drop=True)
df

Notice that the number of rows dropped from 1472 to 1465 (This should be different for your data).

We will create additional columns by taking advantage of built-in functions in pandas. First, let us create a Datetime column by merging Date and Time columns and using the pd.to_datetime function:

df['DateTime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
df['DateTime']

Using our newly created Datetime column we can create a weekday column showing days of the week.

df['weekday'] = df['DateTime'].apply(lambda x: x.day_name())

We will split our Content column to create new columns showing the number of letters and words contained in each message. We will call these columns Letter_count and Word_count respectively.

df['Letter_Count'] = df['Content'].apply(lambda s : len(s))
df['Word_Count'] = df['Content'].apply(lambda s : len(s.split(' ')))
Learn Data Analysis with Excel

Next, we will split our Time column to create a new column named Hour showing the hour of the day a message was sent. For example, given 12:15, we will split the data before the colon (12) as this indicates the hour.

df['Hour'] = df['Time'].apply(lambda x : x.split(':')[0])
# The first token of a value in the Time Column contains the hour (Eg., "12" in "12:15")

Let us take a look at our Dataframe df to see the result of our preprocessing so far:

#print first five rows of our dataframe
df.head()

This gives the output:

We can now save our Dataframe df in CSV format by using pandas built-in function. This CSV will be used in creating our visuals in Power BI.

#saving to csv format
df.to_csv("WhatsappChat.csv")

Our CSV file is now saved as “WhatsappChat.csv” in our project folder. We will use this CSV file in the next part of this tutorial to create awesome visuals in Power BI.

You can check out the code file at this github repo: http://bit.ly/Obinnagithub

I got inspiration for this project here https://github.com/iamreechi

Author: Obinna Iheanachor | LinkedIn


Continue reading: Visualize WhatsApp Chats Using Python and Power BI – Part 2

To find out more about what we do, subscribe to our insights below. You can reach out to us directly for training and analytics consulting needs.

Receive Fresh, Weekly Analytics Content

Join 1K+ subscribers to receive original, freshly curated content every week.

Related article – Case Study: Analytics for a Polythene Manufacturing Company

Related article – Insights Dashboards – Nigeria’s 2020 Proposed Budget

Hands-on article – Visualizing The Top 1000 Lifetime Grossing Movies