Data Wrangling Report for WeRateDogs Twitter Archive¶

Introduction¶

This report describes the data wrangling steps performed in the preparation of the WeRateDogs (wrd) Twitter Archive dataset for analysis.

Gathering Data¶

3 different datasets were used for this project

  • a dataset containing the WeRateDogs Twitter archive was provided by Udacity for manual download
  • a dataset containing image predictions for each tweet (dog or not) was downloaded programmatically using the Requests library
  • a dataset containing tweet data in JSON format (including tweet ID, creation timestamps, retweet/favorite counts, full tweet text, etc) was gathered by querying the Twitter API for each tweet's JSON data using Python's Tweepy library

Assessing Data¶

The following issues were identified by assessing the data both visually and programmatically:

Quality issues

  1. erroneous data types (for timestamp, source, retweeted_status_timestamp columns) in wrd archive

  2. erroneous data types (for created_at, source, lang columns) in tweet data

  3. some dogs have values for multiple stages

  4. some dogs have no names

  5. p1, p2, and p3 contain names with inconsistent casing and separators (mixes uppercase and lowercase, mixes hyphens and underscores, and has some other non-letter characters like quotes)

  6. some rows are predicted to not be dogs at all even though they do contain dogs (saw "banana", and "china_cabinet", to name a few)

  7. some fields in tweet data are completely empty (geo, coordinates, contributors)

  8. duplicated urls in the expanded_urls column of the wrd archive

  9. dog names a and the are not real names

  10. some tweets are retweets

  11. inconsistent denominators for ratings

Tidiness issues

  1. all the 3 dataframes are part of the same observational unit (tweet info)

  2. wrd archive has 4 columns for the same "dog stage" observation (doggo, floofer, pupper, puppo)

  3. tweet data has columns that contain duplicated info (all column names that end with _str are just string versions of the same info in the column lacking the suffix)

  4. display_text_range in tweet data contains 2 sets of info

Cleaning Data¶

Cleaning was performed using the "Define, Code, Test" steps whereby the necessary cleaning steps are defined to the appropriate level of detail, followed by writing the code to implement the defined steps, and then writing more code to test the final outcome and confirm that the data has been cleaned correctly. The following cleaning steps were performed to address the identified quality and tidiness issues:

Cleaning Quality Issues¶

  • erroneous data types were fixed by converting to the appropriate data type. For the source columns, the contents of the HTML tags were extracted first and used as the categories.
  • for rows having values for multiple dog stages, the oldest dog stage was selected and the other values were replaced with None (python's null value).
  • missing values (and some obviously incorrect dog names) were replaced with None (python's null value)
  • empty columns were removed.
  • predicted names were cleaned by replacing all non-letter characters with underscores.
  • rows predicted to not be dogs were removed
    • this had some imprecision because some of those rows actually contained dogs, but the analyst decided it would be better to be sure that the non-dog rows were removed
  • duplicated urls were removed
  • tweets that were retweets and replies were removed
  • ratings were normalized by dividing the numerator by the denominator and then multiplying by 10

Cleaning Tidiness Issues¶

  • the 3 dataframes were merged into a single dataframe using the tweet_id column as the key
  • the 4 columns for the same "dog stage" observation in the wrd archive were combined into a single column named dog_stage where the values are the dog stage names or None (python's null value) if no dog stage value was available for that row.
  • all columns that end with _str were removed.
  • the contents of the display_text_range column were split into display_text_start and display_text_end columns.

Storing Data¶

The final cleaned dataframe was stored in a CSV file named twitter_archive_master.csv.