Averaging Values

This video was recorded using an earlier version of the sandbox (Neo4j 4.x). You are using a Neo4j 5.4 sandbox for this course. The code example for durations has changed to add a WITH clause for sum() and avg().

Using avg() to average values

Cypher avg() returns the average of all values retrieved during aggregation. The values must be numeric types or durations. The result returned is a float.

Here is an example:

cypher
PROFILE MATCH (m:Movie)
WHERE m.year = 2000
RETURN avg(m.revenue) as AvgRevenue

Notice that for this Cypher code, the Movie node data is aggregated to create the average of the revenue property values.

If the values used for avg() contain nulls, then that element is not used in the calculation.

Here is an example:

cypher
WITH [1,3, null] as Nums
UNWIND Nums AS x
RETURN avg(x)

In this query, only 2 elements are used to sum the values and then divide by 2 to create the average returned.

Here is another example where we use a subquery to filter the data:

cypher
CALL {
  MATCH (m:Movie) WHERE m.year = 2000
  RETURN m ORDER BY m.imdbRating DESC LIMIT 10
}
MATCH  (:User)-[r:RATED]->(m)
RETURN m.title as Movie, avg(r.rating) AS AverageRating

The subquery is demarcated by the {}s here. The subquery returns 10 Movie nodes of movies released in 2000 with the highest imdbRating. Then with these 10 movies, it determines the average rating for each of these 10 movies.

avg() and eager aggregation

A best practice in a query is to aggregate early. Here is an example where we do so:

cypher
PROFILE MATCH (m:Movie)
WITH avg(m.imdbVotes) as AverageVote
MATCH (m2:Movie)
WHERE m2.year = 2010 AND m2.imdbVotes > AverageVote
RETURN AverageVote AS OverallAverageVote, m2.title AS Title , m2.imdbVotes AS AverageVote

The call to avg() requires that all movie nodes are retrieved before the next step in the execution plan.

Here is another example:

cypher
MATCH (u:User {name: "Misty Williams"})-[r:RATED]->(:Movie)
WITH u, avg(r.rating) AS average
MATCH (u)-[r:RATED]->(m:Movie)
WHERE r.rating > average
RETURN average , m.title AS movie,
r.rating as rating
ORDER BY rating DESC

For this query, we first calculate the average rating for all movies that Misty Williams rated. Then we use this calculated value, Average as a test for the second MATCH.

Using avg() for durations

In addition to aggregating property values for nodes retrieved as an average, you can also aggregate durations and average them. A Neo4j duration type is used to calculate the amount of time between two dates or times.

Here is an example, modified from the query in the previous lesson.

cypher
MATCH (a:Actor)
WHERE
a.born IS NOT NULL
AND a.name STARTS WITH 'Tom'
WITH count(a) AS NumActors, collect(duration.between(date(a.born), date())) AS Ages
UNWIND Ages AS x
WITH  NumActors, sum(x) AS TotalAges, avg(x) AS AverageAge
RETURN  TotalAges, TotalAges/NumActors, AverageAge

In this query, we retrieve all Actor nodes with a name that begins with "Tom". We collect the durations between the birth date and now. We then UNWIND the list of durations to aggregate the values using avg(). We return the sum, calculated average, and result of avg().

Calculating standard deviation

Used frequently with averages is the standard deviation value. There are two flavors of standard deviation. stddev() uses a sample of the values for an unbiased estimate. stddevP() uses the exact population of values so for accuracy, this function should be used. However, stddev() will perform better.

Here is an example:

cypher
PROFILE MATCH (m:Movie)
RETURN avg(m.imdbRating), stdev(m.imdbRating), stdevP(m.imdbRating)

Check your understanding

1. Elements used for avg()

Given this list:

[0,null,1,null,3,null,6,null,0,null]

What is the result of calling avg() on its elements?

  • ❏ 1.0

  • ✓ 2.0

Hint

Null elements are not used in the calculation.

Solution

The correct answer is 2.0.

2. Average of ratings

We want to return the average of the User ratings for all movies released in 2000. How do you return this data?

Use the dropdown below to select the correct code to complete this query.

Once you have selected your option, click the Check Results query button to continue.

cypher
MATCH (m:Movie)-[r:RATED]-(:User)
WHERE m.year = 2000
/*select:RETURN avg(r.rating) AS AverageRatings*/
  • RETURN mean(r.rating) AS AverageRatings

  • RETURN avg(r.rating) AS AverageRatings

  • RETURN collect(r.rating)/count(r) AS AverageRatings

  • RETURN collect(r.rating)++/count(r) AS AverageRatings

Hint

This function totals and calculates the average numeric or duration values.

Solution

The correct answer is RETURN avg(r.rating) AS AverageRatings

Summary

In this lesson, you reviewed and learned how aggregation is used to average data in your Cypher queries. In the next Challenge, you will create a query that averages data.