How To Structure Unstructured Data

One of the most useful features of LLMs is to structure unstructured data, from Text To Code

As I wrote in one of the first editions of my newsletter, if you want to leverage AI-assisted coding, it’s crucial to learn how to talk to machines.

Extracting Values From Text

It used to be quite cumbersome to extract relevant entities from unstructured data.

Before the advent of mainstream LLMs, you had to use dedicated APIs / packages, covering a rather narrow scope.

Some were for instance able to extract city names, companies or personalities from a text string. But that was pretty much it.

Today, LLMs can figure out how to extract specific values from unstructured data based on a simple brief, which can be customized for each use case.

The AI will even be able to automatically enrich the source data when some info is missing.

The trick is to prompt the LLM to output a JSON incl. a set of keys when presented with unstructured data.

Structuring Book Reviews

Let’s take a simple example, for book reviews.

Let’s say that you write in ChatGPT:

Today is May 14, 2024. 
Yesterday I read Zero To One. It’s a good book, I would give it 8 out of 10. I loved the way the author detailed the power of monopolies. Please extract the following data and output it as a JSON dictionary: reading_date, title, author, genre, review_score, review_summary

Notice that I did not mention neither the author nor the genre.

Here’s the output of ChatGPT 4o:

{
  "reading_date": "2024-05-13",
  "title": "Zero To One",
  "author": "Peter Thiel with Blake Masters",
  "genre": "Business, Startup, Entrepreneurship",
  "review_score": 8,
  "review_summary": "I loved the way the author detailed the power of monopolies."
}

ChatGPT was able to correctly set the date to May 13, to find the co-authors (I didn’t even know Peter Thiel had partnered with Blake Masters to write the book), to pick the genre and then fill out the values for the data shared in my text string.

I can now feed the AI with another review, without repeating my brief, simply sharing the review:

Two days ago, I read the latest biography of Elon Musk. It reads like a novel, I’ll give it 9 out of 10. The book went into great details to describe the impact of Musk’s childhood on his business mindset. 

In less than 1 second, I got this JSON from ChatGPT.

 {
      "reading_date": "2024-05-12",
      "title": "Elon Musk: A Biography",
      "author": "Walter Isaacson",
      "genre": "Biography",
      "review_score": 9,
      "review_summary": "It reads like a novel. The book went into great details to describe the impact of Musk’s childhood on his business mindset."
    }

The cool thing about JSON formatting is that it enables you to properly record the data in a database.

Instant Flask Application

We can prompt ChatGPT to give us the boilerplate of a Flask app recording those dictionaries into a SQLite database.

Here is my full prompt, incl. the instruction to call the OpenAI API.

I need the full code of a Flask app, presented in one single .py file, which would:

1° prompt the user to type in their review straight in the Python shell

2° call the OpenAI API with a prompt to properly extract the structured data from the review text string, enriching the original data when needed

3° print out the JSON 
 
4° store the structured data returned in JSON format from OpenAI's API into a SQLite database (importing sqlite3)

5° confirm data storage and ask the user whether they want to input another review or quit 

Important:
Load the API key from .env
Use GPT4 as a model when interacting with OpenAI, not Davinci.

I got this reply from ChatGPT.

from flask import Flask, request, jsonify
import openai
import os
from dotenv import load_dotenv
import sqlite3
import sys

# Load environment variables
load_dotenv()
openai.api_key = os.getenv('OPENAI_API_KEY')

# OpenAI API setup
client = openai.Client()

app = Flask(__name__)

# SQLite setup
conn = sqlite3.connect('reviews.db', check_same_thread=False)
c = conn.cursor()
c.execute('''
    CREATE TABLE IF NOT EXISTS reviews (
        id INTEGER PRIMARY KEY,
        reading_date TEXT,
        title TEXT,
        author TEXT,
        genre TEXT,
        review_score INTEGER,
        review_summary TEXT
    )
''')
conn.commit()

def extract_data_from_review(review_text):

    prompt = f"""
    Based on the review, output a JSON dictionary using the following keys: reading_date, title, author, genre, review_score, review_summary. 
    Review: {review_text}.
    
    """
    response =  client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    structured_data = response.choices[0].message.content

    return structured_data

def store_review_in_db(review_data):
    c.execute('''
        INSERT INTO reviews (reading_date, title, author, genre, review_score, review_summary) 
        VALUES (?, ?, ?, ?, ?, ?)
    ''', (review_data['reading_date'], review_data['title'], review_data['author'], review_data['genre'], review_data['review_score'], review_data['review_summary']))
    conn.commit()

@app.route('/submit_review', methods=['POST'])
def submit_review():
    review_text = request.json.get('review_text')
    structured_data = extract_data_from_review(review_text)
    review_data = eval(structured_data)  # Convert JSON string to dictionary
    store_review_in_db(review_data)
    return jsonify({"message": "Review stored successfully", "data": review_data})

if __name__ == '__main__':
    try:
        while True:
            review_text = input("Please type in your review: ")
            structured_data = extract_data_from_review(review_text)

            #clean the data removing backticks and json
            structured_data = structured_data.replace('```json\n', '').replace('```', '')
            review_data = eval(structured_data)

            print("Structured Data: ", review_data)
            store_review_in_db(review_data)
            print("Data stored successfully.")
            another = input("Do you want to input another review? (yes/no): ")
            if another.lower() != 'yes':
                print("Bye bye!")
                sys.exit()
    except KeyboardInterrupt:
        print("Bye bye!")
    app.run()

It was almost good.

I had to add the logic to set the reading_date vs the current date, using datetime.

I also insisted on the fact that the output should be JSON.

It’s a good practice to end your prompt with clear instructions (here JSON OUTPUT:).

from datetime import datetime
  current_date = datetime.now().strftime("%Y-%m-%d")

    prompt = f"""
    Based on the review, output a JSON dictionary using the following keys: reading_date, title, author, genre, review_score, review_summary. 
    Enrich the data based if the value can't be found in the review.
    The reading_date must be computed compared to the current date.
    Current Date: {current_date}
    Review: {review_text}.
    IMPORTANT: Please make sure the output is a valid JSON.
    JSON OUTPUT:
    """

And since I’m using GPT4, I also added some code to clean the text response from the API, removing backticks at both ends and the word JSON.

structured_data = structured_data.replace('```json\n', '').replace('```', '')

Here’s the full code which you can run in your IDE after installing the required packages.

Don’t forget to set your OPENAI_API_KEY in .env at the root of your project.

from flask import Flask, request, jsonify
import openai
import os
from dotenv import load_dotenv
import sqlite3
from datetime import datetime
import sys

# Load environment variables
load_dotenv()
openai.api_key = os.getenv('OPENAI_API_KEY')

# OpenAI API setup
client = openai.Client()

app = Flask(__name__)

# SQLite setup
conn = sqlite3.connect('reviews.db', check_same_thread=False)
c = conn.cursor()
c.execute('''
    CREATE TABLE IF NOT EXISTS reviews (
        id INTEGER PRIMARY KEY,
        reading_date TEXT,
        title TEXT,
        author TEXT,
        genre TEXT,
        review_score INTEGER,
        review_summary TEXT
    )
''')
conn.commit()

def extract_data_from_review(review_text):

    current_date = datetime.now().strftime("%Y-%m-%d")

    prompt = f"""
    Based on the review, output a JSON dictionary using the following keys: reading_date, title, author, genre, review_score, review_summary. 
    Enrich the data based if the value can't be found in the review.
    The reading_date must be computed compared to the current date.
    Current Date: {current_date}
    Review: {review_text}.
    IMPORTANT: Please make sure the output is a valid JSON.
    JSON OUTPUT:
    """
    response =  client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    structured_data = response.choices[0].message.content

    return structured_data

def store_review_in_db(review_data):
    c.execute('''
        INSERT INTO reviews (reading_date, title, author, genre, review_score, review_summary) 
        VALUES (?, ?, ?, ?, ?, ?)
    ''', (review_data['reading_date'], review_data['title'], review_data['author'], review_data['genre'], review_data['review_score'], review_data['review_summary']))
    conn.commit()

@app.route('/submit_review', methods=['POST'])
def submit_review():
    review_text = request.json.get('review_text')
    structured_data = extract_data_from_review(review_text)
    review_data = eval(structured_data)  # Convert JSON string to dictionary
    store_review_in_db(review_data)
    return jsonify({"message": "Review stored successfully", "data": review_data})


if __name__ == '__main__':
    try:
        while True:
            review_text = input("Please type in your review: ")
            structured_data = extract_data_from_review(review_text)

            #clean the data removing backticks and json
            structured_data = structured_data.replace('```json\n', '').replace('```', '')
            review_data = eval(structured_data)

            print("Structured Data: ", review_data)
            store_review_in_db(review_data)
            print("Data stored successfully.")
            another = input("Do you want to input another review? (yes/no): ")
            if another.lower() != 'yes':
                print("Bye bye!")
                sys.exit()
    except KeyboardInterrupt:
        print("Bye bye!")
    app.run()

See how the LLM managed to enrich the data of my 2 inputs.

Please type in your review: Yesterday I read Zero To One. 9 out of 10. I loved the way it described the power of monopolies. 
Structured Data:  {'reading_date': '2024-05-13', 'title': 'Zero To One', 'author': 'Peter Thiel', 'genre': 'Business and Economics', 'review_score': 9, 'review_summary': 'I loved the way it described the power of monopolies.'}
Data stored successfully.
Do you want to input another review? (yes/no): yes
Please type in your review: Two days ago I finished the last book of the Harry Potter series. Epic and Magic. 10 out of 10.
Structured Data:  {'reading_date': '2024-05-12', 'title': 'Harry Potter and the Deathly Hallows', 'author': 'J.K. Rowling', 'genre': 'Fantasy', 'review_score': 10, 'review_summary': 'Epic and Magic'}
Data stored successfully.

Harry Potter and the Deathly Hallows is indeed the last book in the series.

If you want to browse the SQLite database, install DB Browser for SQLite (it’s free).

It will enable you to see the records.

Have fun coding with your favourite AI Assistant!

If you’d like to get a private introduction to the art of AI-assisted coding and more broadly a detailed overview of today’s Gen AI capabilities, I’m offering one-on-one 2-hour mentoring sessions “How To Talk To An AI Agent”.

Here’s the link for bookings.

Sessions are tailored to your specific business needs.

I can also assist you in the development of your own micro SaaS project.