Thankfully, the data that I needed was scraped already in a reddit post I found. Without this, I would have to scrape the subreddit which would take me a long time and extensive amounts of time since the Reddit API is rate limited. The project utilizes data scraped from the r/JapanTravel subreddit, including both submissions and comments. As described in the reddit post, the files downloaded are in .zst format. We have submissions.zst
and comments.zst
. After converting them to .csv, we can explore our data.
submission.csv
There are 6 columns to look at each submission:
- Score
- Date
- Title
- Author
- URL
- Content
To us, the most important columns are Date , Title, and Content. But first, let’s look at the dataframe

About the data
- Time Period: the data spans from approximately 2016 to 2022
- Quantity: 83,840 submissions and 619,514 comments
Data Challenges
- Structure Issue
- Multiple Threads: Since each conversation contained a submission post and multiple threaded comments, it required additional processing for every submission
- Multiple Participants: It was difficult to distinguish between the original poster and community members and required further data processing to do so.
- Inconsistent Quality
- Incomplete Information: Some posts lacked necessary details since they were either removed or deleted by the user when it was scraped
- Mixed Value Content: Some threads had off topic discussions or even no discussions at all. It was hard to tell which answers were correct