Fine-tuning The Cypher Generation

With any dataset, there are peculiarities that you need to account for.

Have you tried asking the bot What year was The Matrix released? If you ask it, it may respond, telling you that it cannot provide an answer. This is due to a peculiarity in the data.

The Problem

If you look at the database, movies with a title starting with The have the The portion moved to the end in the .title property. For example, The 39 Steps is stored as 39 Steps, The.

The database also contains embeddings for the movie titles. The Cypher the LLM generates will sometimes return these embeddings, resulting in a token limit error when the LLM tried to process the returned data.

To fix these or any other data-related discrepancies, you can fine-tune the Cypher generation process by modifying the prompt and instructions used by the LLM when generating the Cypher statement.

The Solution

You can pass a cypher_prompt when creating Cypher QA chain. The prompt must include two placeholders, {schema} and {question}, which will be populated by the chain.

In tools/cypher.py, create a cypher generation prompt that addresses these issues:

python
from langchain.prompts.prompt import PromptTemplate

CYPHER_GENERATION_TEMPLATE = """
You are an expert Neo4j Developer translating user questions into Cypher to answer questions about movies and provide recommendations.
Convert the user's question based on the schema.

Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.

Do not return entire nodes or embedding properties.

Fine Tuning:

For movie titles that begin with "The", move "the" to the end. For example "The 39 Steps" becomes "39 Steps, The" or "the matrix" becomes "Matrix, The".


Schema:
{schema}

Question:
{question}

Cypher Query:
"""

cypher_prompt = PromptTemplate.from_template(CYPHER_GENERATION_TEMPLATE)

Review the prompt and note how specific instructions are provided to deal with:

  1. The The prefix in movie titles.

  2. Not to return embeddings or entire nodes (which could contain embeddings).

Add the cypher_prompt when creating the cypher_qa chain:

python
Append the Cypher Prompt argument
cypher_qa = GraphCypherQAChain.from_llm(
    llm,
    graph=graph,
    verbose=True,
    cypher_prompt=cypher_prompt
)

Testing the Prompt

Ask the question What year was The Matrix released? You should see the Cypher statement with , The appended to the string.

Console output
> Entering new AgentExecutor chain...
```json
{
    "action": "Cypher QA",
    "action_input": "who directed the matrix?"
}
```
> Entering new GraphCypherQAChain chain...
Generated Cypher:
cypher
MATCH (m:Movie)<-[:DIRECTED]-(d:Director)
WHERE m.title = "Matrix, The"
RETURN d.name AS director
Full Context:
[{'director': 'Lana Wachowski'}, {'director': ' Lilly Wachowski'}]
> Finished chain.
Observation: {'query': 'who directed the matrix?', 'result': 'The Matrix was directed by Lana Wachowski and Lilly Wachowski.'}
Thought:```json
{
    "action": "Final Answer",
    "action_input": "The Matrix was directed by Lana Wachowski and Lilly Wachowski."
}
```
> Finished chain.
> Entering new AgentExecutor chain...
```json
{
    "action": "Final Answer",
    "action_input": "The Matrix was directed by Lana Wachowski and Lilly Wachowski."
}
```
> Finished chain.

However, the Cypher generated may still generate errors or the wrong results. You can further improve the responses by using few-shot prompting, which you will learn about in the next lesson.

Once you have completed the steps, click the button below to mark the lesson as completed.

Summary

In this lesson, you used fine-tuning to problems with Cypher statement generation.

In the next lesson, you will learn how few-shot prompting can help improve the quality of the Cypher statements generated by the LLM.