Refactoring Duplicate Data

Querying duplicate data

Here is our current instance model where each Movie node has a languages property:

Instance model with languages

For our latest use case:

Use case #11: What movies are available in a particular language?

This query finds all movies in Italian:

cypher
MATCH (m:Movie)
WHERE 'Italian' IN m.languages
RETURN m.title

What this query does is retrieve all Movie nodes and then test whether the languages property contains Italian. There are two issues with the data model, especially if the graph scales:

  • The name of the language is duplicated in many Movie nodes.

  • In order to perform the query, all Movie nodes must be retrieved.

A solution here is to model properties as nodes.

Refactor properties as nodes

Here are the steps we use to refactor:

  1. We take the property values for each Movie node and create a Language node.

  2. Then we create the IN_LANGUAGE relationship between that Movie node and the Language node.

  3. Finally, we remove the languages property from the Movie node.

This is the code to refactor the graph to turn the property values into nodes:

cypher
MATCH (m:Movie)
UNWIND m.languages AS language
WITH  language, collect(m) AS movies
MERGE (l:Language {name:language})
WITH l, movies
UNWIND movies AS m
WITH l,m
MERGE (m)-[:IN_LANGUAGE]->(l);
MATCH (m:Movie)
SET m.languages = null

This code iterates through all Movie nodes and creates a Language node for each language it finds and then creates the relationship between the Movie node and Language node using the IN_LANGUAGE relationship. It uses the Cypher UNWIND clause to separate each element of the languages property list into a separate row value that is processed later in the query.

This is what the instance model looks like after the refactoring:

Instance model with language node

There will only be one node with the language value of English and we remove the languages property from all Movie nodes. This eliminates a lot of duplication in the graph.

Check your understanding

Separating lists

What Cypher clause do you use to separate list elements?

  • MATCH

  • WITH

  • RETURN

  • UNWIND

Hint

This clause operates on a list to return a row value.

Solution

The clause used to separate a list into individual elements is called UNWIND.

Summary

In this lesson, you learned how to create nodes from properties containing lists of values to eliminate duplication. In the next challenge, you perform a refactoring that creates Language nodes from the languages properties.