Thankfully, the data that I needed was scraped already in a reddit post I found. Without this, I would have to scrape the subreddit which would take me a long time and extensive amounts of time since the Reddit API is rate limited. The project utilizes data scraped from the r/JapanTravel subreddit, including both submissions and comments. As described in the reddit post, the files downloaded are in .zst format. We have submissions.zst and comments.zst. After converting them to .csv, we can explore our data.
submission.csv
There are 6 columns to look at each submission:
- Score
- Date
- Title
- Author
- URL
- Content
To us, the most important columns are Date , Title, and Content. But first, let’s look at the dataframe

About the data
- Time Period: the data spans from approximately 2016 to 2022
- Quantity: 83,840 submissions and 619,514 comments
Data Challenges
- Structure Issue
- Multiple Threads: Since each conversation contained a submission post and multiple threaded comments, it required additional processing for every submission
- Multiple Participants: It was difficult to distinguish between the original poster and community members and required further data processing to do so.
- Inconsistent Quality
- Incomplete Information: Some posts lacked necessary details since they were either removed or deleted by the user when it was scraped
- Mixed Value Content: Some threads had off topic discussions or even no discussions at all. It was hard to tell which answers were correct