Data Wrangling Report for WeRateDogs Twitter Archive¶

Introduction¶

This report describes the data wrangling steps performed in the preparation of the WeRateDogs (wrd) Twitter Archive dataset for analysis.

Gathering Data¶

3 different datasets were used for this project

a dataset containing the WeRateDogs Twitter archive was provided by Udacity for manual download
a dataset containing image predictions for each tweet (dog or not) was downloaded programmatically using the Requests library
a dataset containing tweet data in JSON format (including tweet ID, creation timestamps, retweet/favorite counts, full tweet text, etc) was gathered by querying the Twitter API for each tweet's JSON data using Python's Tweepy library

Assessing Data¶

The following issues were identified by assessing the data both visually and programmatically:

Quality issues

erroneous data types (for timestamp, source, retweeted_status_timestamp columns) in wrd archive
erroneous data types (for created_at, source, lang columns) in tweet data
some dogs have values for multiple stages
some dogs have no names
p1, p2, and p3 contain names with inconsistent casing and separators (mixes uppercase and lowercase, mixes hyphens and underscores, and has some other non-letter characters like quotes)
some rows are predicted to not be dogs at all even though they do contain dogs (saw "banana", and "china_cabinet", to name a few)
some fields in tweet data are completely empty (geo, coordinates, contributors)
duplicated urls in the expanded_urls column of the wrd archive
dog names a and the are not real names
some tweets are retweets
inconsistent denominators for ratings

Tidiness issues

all the 3 dataframes are part of the same observational unit (tweet info)
wrd archive has 4 columns for the same "dog stage" observation (doggo, floofer, pupper, puppo)
tweet data has columns that contain duplicated info (all column names that end with _str are just string versions of the same info in the column lacking the suffix)
display_text_range in tweet data contains 2 sets of info

Cleaning Data¶

Cleaning was performed using the "Define, Code, Test" steps whereby the necessary cleaning steps are defined to the appropriate level of detail, followed by writing the code to implement the defined steps, and then writing more code to test the final outcome and confirm that the data has been cleaned correctly. The following cleaning steps were performed to address the identified quality and tidiness issues:

Cleaning Quality Issues¶

erroneous data types were fixed by converting to the appropriate data type. For the source columns, the contents of the HTML tags were extracted first and used as the categories.
for rows having values for multiple dog stages, the oldest dog stage was selected and the other values were replaced with None (python's null value).
missing values (and some obviously incorrect dog names) were replaced with None (python's null value)
empty columns were removed.
predicted names were cleaned by replacing all non-letter characters with underscores.
rows predicted to not be dogs were removed
- this had some imprecision because some of those rows actually contained dogs, but the analyst decided it would be better to be sure that the non-dog rows were removed
duplicated urls were removed
tweets that were retweets and replies were removed
ratings were normalized by dividing the numerator by the denominator and then multiplying by 10

Cleaning Tidiness Issues¶

the 3 dataframes were merged into a single dataframe using the tweet_id column as the key
the 4 columns for the same "dog stage" observation in the wrd archive were combined into a single column named dog_stage where the values are the dog stage names or None (python's null value) if no dog stage value was available for that row.
all columns that end with _str were removed.
the contents of the display_text_range column were split into display_text_start and display_text_end columns.

Storing Data¶

The final cleaned dataframe was stored in a CSV file named twitter_archive_master.csv.