The Pipeline | Notion

Data Collection and Preprocessing

Using EDA.ipynb , I conducted EDA on the dataset to better understand the data and its distribution. This lead me to do the following:

Content Distribution: WIP (waiting on training data processing)
- After the training data gets labeled and classified, I will try to ensure that there is a balanced representation across categories. If not, I will prompt another LLM to generate synthetic data based on the processed ones as to not misrepresent any category.
Text Length: I saw that the submission length was usually shorter than the length of the comment threads. This prompted me to increase the context length in order to accommodate for such long sequences
Quality of Posts: The EDA showed that some of the threads contained minimal or unhelpful responses, so I decided to add a "NO_USEFUL_ANSWER" to posts or responses that the LLM did not know how to classify.

flowchart LR
    title["Data Collection and Preprocessing"]
    style title fill:none,stroke:none,color:black,font-size:16px
    
    title === A
    
    A[Dowload JapanTravel.zst] --> B[Convert to Dataframe] --> C[Extract submissions and comments] --> D[Filter and clean data] --> E[Convert conversation to JSON] -->F[Save as CSV]
    
    %% Node styling
    classDef boxStyle fill:#ffcccb,stroke:#ff6b6b,stroke-width:1px,color:#333333
    classDef titleClass fill:none,stroke:none
    
    A:::boxStyle
    B:::boxStyle
    C:::boxStyle
    D:::boxStyle
    E:::boxStyle
    F:::boxStyle
    title:::titleClass

Preparing the Training Data

flowchart LR
    title["Preparing the Training Data"]
    style title fill:none,stroke:none,color:black,font-size:16px
    
    title === A
    
    A[Raw CSV] --> B[Parse Coversations] --> C[Prepare Prompt] --> D[Send to local LLM]
    E[Summarize comment thread and generate advice]--> F[Classify Category] --> G[Save as Training data CSV]
    
    %% Node styling
    classDef boxStyle fill:#ffcccb,stroke:#ff6b6b,stroke-width:1px,color:#333333
    classDef inProgress fill:#ffd7c1,stroke:#ff9f76,stroke-width:2px,color:#333333
    classDef titleClass fill:none,stroke:none
    
    A:::boxStyle
    B:::boxStyle
    C:::boxStyle
    D:::boxStyle
    E:::inProgress
    F:::inProgress
    G:::inProgress

    title:::titleClass

Currently, the project is at this step. We

Input Raw CSV : this is the csv that we saved at the end of our EDA. It contains
- Post_ID
- Title
- Submission_Content
- Conversation
Input Processing: Using conversation_summary.py we extract the necessary columns and append it into a prompt which is used to feed into the local LLM
LLM processing: Using LMStudio, we can pass the prompt and ask the LLM to complete it. Here are the settings I’m using and the model running is granite-3.1-8b

This is the system prompt I’m using:

<aside> 💡

You are a Japan travel expert providing personalized travel advice.

CRITICAL: You are the SOLE advisor. Never mention Reddit, discussions, forums, or other people's opinions. Present all advice as YOUR expert knowledge.

TASK: Analyze the community input and provide helpful travel advice.

IF NO USEFUL ANSWER EXISTS: Category: Questions Response: I'm not sure about that, or similar uncertainty phrases.

OUTPUT FORMAT (follow exactly): Category: [Categorize as ONE of: Trip Reports, Itineraries, Recommendations, Questions, Advice] Response: [2-4 sentences of specific, actionable Japan travel advice. Write in first person as YOUR recommendations.]

RESPONSE QUALITY RULES:
- Include specific details: locations, prices, timeframes, services
- Give actionable steps travelers can take
- Use natural, helpful tone as a personal advisor
- Present multiple solutions but focus on the most popular/reasonable
- Write as if this is YOUR personal expertise
NEVER MENTION:
- Reddit, redditors, discussions, forums, communities
- "Based on the conversation" or "others suggest"
- "People say" or "according to users"
- Any reference to this being from a discussion
EXAMPLE: Category: Questions Response: I recommend booking your car rental in advance through Tocoo, which offers significant discounts especially from New Chitose Airport. You should reserve your preferred vehicle class early since larger cars get fully booked during peak seasons, and advance booking typically guarantees better rates.

OUTPUT: One category + one response only. Stop immediately after.

</aside>

After messing around with the prompt for a awhile, I found that you have to explicitly tell it to present it as its own POV or it will start to mention other redditors or reddit which we don’t want to in the final product.

Model Training and development

Still waiting for the training data :(

flowchart LR
    title["# Model Training and Deployment"]
    style title fill:none,stroke:none,color:black,font-size:16px
    
    title === A
    
    A[Summarized Data CSV] --> B[Preprocessing] --> C[MLX Fine-tuning] --> D[Model Evaluation] --> E[Deployment] --> F[User Interface]
    
    %% Node styling
    classDef boxStyle fill:#ffcccb,stroke:#ff6b6b,stroke-width:1px,color:#333333
    classDef inProgress fill:#ffd7c1,stroke:#ff9f76,stroke-width:2px,color:#333333
    classDef pending fill:#ffeaea,stroke:#ffb3b3,stroke-width:1px,color:#777777
    classDef titleClass fill:none,stroke:none
    
    A:::inProgress
    B:::pending
    C:::pending
    D:::pending
    E:::pending
    F:::pending
    title:::titleClass

sample_training