Using EDA.ipynb
, I conducted EDA on the dataset to better understand the data and its distribution. This lead me to do the following:
flowchart LR
title["Data Collection and Preprocessing"]
style title fill:none,stroke:none,color:black,font-size:16px
title === A
A[Dowload JapanTravel.zst] --> B[Convert to Dataframe] --> C[Extract submissions and comments] --> D[Filter and clean data] --> E[Convert conversation to JSON] -->F[Save as CSV]
%% Node styling
classDef boxStyle fill:#ffcccb,stroke:#ff6b6b,stroke-width:1px,color:#333333
classDef titleClass fill:none,stroke:none
A:::boxStyle
B:::boxStyle
C:::boxStyle
D:::boxStyle
E:::boxStyle
F:::boxStyle
title:::titleClass
flowchart LR
title["Preparing the Training Data"]
style title fill:none,stroke:none,color:black,font-size:16px
title === A
A[Raw CSV] --> B[Parse Coversations] --> C[Prepare Prompt] --> D[Send to local LLM]
E[Summarize comment thread and generate advice]--> F[Classify Category] --> G[Save as Training data CSV]
%% Node styling
classDef boxStyle fill:#ffcccb,stroke:#ff6b6b,stroke-width:1px,color:#333333
classDef inProgress fill:#ffd7c1,stroke:#ff9f76,stroke-width:2px,color:#333333
classDef titleClass fill:none,stroke:none
A:::boxStyle
B:::boxStyle
C:::boxStyle
D:::boxStyle
E:::inProgress
F:::inProgress
G:::inProgress
title:::titleClass
Currently, the project is at this step. We
Input Raw CSV : this is the csv that we saved at the end of our EDA. It contains
Input Processing: Using conversation_summary.py
we extract the necessary columns and append it into a prompt which is used to feed into the local LLM
LLM processing: Using LMStudio, we can pass the prompt and ask the LLM to complete it. Here are the settings I’m using and the model running is granite-3.1-8b
This is the system prompt I’m using:
<aside> 💡
You are a Japan travel expert providing personalized travel advice.
CRITICAL: You are the SOLE advisor. Never mention Reddit, discussions, forums, or other people's opinions. Present all advice as YOUR expert knowledge.
TASK: Analyze the community input and provide helpful travel advice.
IF NO USEFUL ANSWER EXISTS: Category: Questions Response: I'm not sure about that, or similar uncertainty phrases.
OUTPUT FORMAT (follow exactly): Category: [Categorize as ONE of: Trip Reports, Itineraries, Recommendations, Questions, Advice] Response: [2-4 sentences of specific, actionable Japan travel advice. Write in first person as YOUR recommendations.]
RESPONSE QUALITY RULES:
NEVER MENTION:
EXAMPLE: Category: Questions Response: I recommend booking your car rental in advance through Tocoo, which offers significant discounts especially from New Chitose Airport. You should reserve your preferred vehicle class early since larger cars get fully booked during peak seasons, and advance booking typically guarantees better rates.
OUTPUT: One category + one response only. Stop immediately after.
</aside>
After messing around with the prompt for a awhile, I found that you have to explicitly tell it to present it as its own POV or it will start to mention other redditors or reddit which we don’t want to in the final product.
Still waiting for the training data :(
flowchart LR
title["# Model Training and Deployment"]
style title fill:none,stroke:none,color:black,font-size:16px
title === A
A[Summarized Data CSV] --> B[Preprocessing] --> C[MLX Fine-tuning] --> D[Model Evaluation] --> E[Deployment] --> F[User Interface]
%% Node styling
classDef boxStyle fill:#ffcccb,stroke:#ff6b6b,stroke-width:1px,color:#333333
classDef inProgress fill:#ffd7c1,stroke:#ff9f76,stroke-width:2px,color:#333333
classDef pending fill:#ffeaea,stroke:#ffb3b3,stroke-width:1px,color:#777777
classDef titleClass fill:none,stroke:none
A:::inProgress
B:::pending
C:::pending
D:::pending
E:::pending
F:::pending
title:::titleClass