This report describes the data wrangling steps performed in the preparation of the WeRateDogs (wrd) Twitter Archive dataset for analysis.
3 different datasets were used for this project
The following issues were identified by assessing the data both visually and programmatically:
Quality issues
erroneous data types (for timestamp, source, retweeted_status_timestamp columns) in wrd archive
erroneous data types (for created_at, source, lang columns) in tweet data
some dogs have values for multiple stages
some dogs have no names
p1, p2, and p3 contain names with inconsistent casing and separators (mixes uppercase and lowercase, mixes hyphens and underscores, and has some other non-letter characters like quotes)
some rows are predicted to not be dogs at all even though they do contain dogs (saw "banana", and "china_cabinet", to name a few)
some fields in tweet data are completely empty (geo, coordinates, contributors)
duplicated urls in the expanded_urls column of the wrd archive
dog names a and the are not real names
some tweets are retweets
inconsistent denominators for ratings
Tidiness issues
all the 3 dataframes are part of the same observational unit (tweet info)
wrd archive has 4 columns for the same "dog stage" observation (doggo, floofer, pupper, puppo)
tweet data has columns that contain duplicated info (all column names that end with _str are just string versions of the same info in the column lacking the suffix)
display_text_range in tweet data contains 2 sets of info
Cleaning was performed using the "Define, Code, Test" steps whereby the necessary cleaning steps are defined to the appropriate level of detail, followed by writing the code to implement the defined steps, and then writing more code to test the final outcome and confirm that the data has been cleaned correctly. The following cleaning steps were performed to address the identified quality and tidiness issues:
source columns, the contents of the HTML tags were extracted first and used as the categories.None (python's null value).None (python's null value)tweet_id column as the keydog_stage where the values are the dog stage names or None (python's null value) if no dog stage value was available for that row._str were removed.display_text_range column were split into display_text_start and display_text_end columns.The final cleaned dataframe was stored in a CSV file named twitter_archive_master.csv.