This report describes the data wrangling steps performed in the preparation of the WeRateDogs (wrd) Twitter Archive dataset for analysis.
3 different datasets were used for this project
The following issues were identified by assessing the data both visually and programmatically:
Quality issues
erroneous data types (for timestamp
, source
, retweeted_status_timestamp
columns) in wrd archive
erroneous data types (for created_at
, source
, lang
columns) in tweet data
some dogs have values for multiple stages
some dogs have no names
p1
, p2
, and p3
contain names with inconsistent casing and separators (mixes uppercase and lowercase, mixes hyphens and underscores, and has some other non-letter characters like quotes)
some rows are predicted to not be dogs at all even though they do contain dogs (saw "banana", and "china_cabinet", to name a few)
some fields in tweet data are completely empty (geo
, coordinates
, contributors
)
duplicated urls in the expanded_urls
column of the wrd archive
dog names a
and the
are not real names
some tweets are retweets
inconsistent denominators for ratings
Tidiness issues
all the 3 dataframes are part of the same observational unit (tweet info)
wrd archive has 4 columns for the same "dog stage" observation (doggo, floofer, pupper, puppo)
tweet data has columns that contain duplicated info (all column names that end with _str
are just string versions of the same info in the column lacking the suffix)
display_text_range
in tweet data contains 2 sets of info
Cleaning was performed using the "Define, Code, Test" steps whereby the necessary cleaning steps are defined to the appropriate level of detail, followed by writing the code to implement the defined steps, and then writing more code to test the final outcome and confirm that the data has been cleaned correctly. The following cleaning steps were performed to address the identified quality and tidiness issues:
source
columns, the contents of the HTML tags were extracted first and used as the categories.None
(python's null value).None
(python's null value)tweet_id
column as the keydog_stage
where the values are the dog stage names or None
(python's null value) if no dog stage value was available for that row._str
were removed.display_text_range
column were split into display_text_start
and display_text_end
columns.The final cleaned dataframe was stored in a CSV file named twitter_archive_master.csv
.