pipe.png

Assumptions

  1. We expect to get the CST records in the form of CSV files

  2. New records should be ahead of the existing data in time i.e. they can't be randomly placed in the middle of the existing data.

  3. There is only one customer and company per conversation thread

  4. NBA engine runs only for those conversations where customer reply is last and is waiting for company.

    Reason:

    1. New data might have the customer reply
    2. Some conversations may have transitioned to private DMs, we wouldn't know if they had resolved it there

Data Pipeline

Data ingestion is done by taking the CST records in the form of CSV files, transforming and normalizing them to a single table. I’ve used JSON files for now. Each conversation starts with an initial message and may continue with follow-up replies. New data is checked—if it's already stored, we skip it; if it continues an existing conversation, we attach it to the end. The system runs regularly to catch new replies and keep the records current. It only passes the unprocessed records to the next step i.e. tagging

erDiagram
    InteractionTable {
        string primary_tweet_id PK
        string primary_tweet
        JSONB Chat_history
        string customer_id
        string company_id
        string tail_id
        boolean processed
    }

Transformation

Database:

Postgres is the right way to store since the table has both relational fields (customer_id, primary_tweet_id) and a nested Chat_history, (JSONB), also allows indexing.

I have not integrated a database into this project as it is currently a prototype. Instead, I used JSON files to store all data and intermediate results for a better understanding and readability.

Scheduling:

We can run the data ingestion pipeline every few hours, since we need to know if customer has replied, so we can give a faster response. For scheduling we can use Airflow because it provides a reliable, scalable way to run your job on a schedule while offering features like automatic retries, logging. We could use a simple CRON Job too.