Back to questions

Movies By Decade Uber PySpark Interview Question

Movies By Decade

Uber PySpark Interview Question

You are working as a data analyst at Netflix, and you have a DataFrame containing information about movies. The DataFrame has a column called , which stores the date each movie was released. Your task is to count how many movies were released in each decade.

At the end, return a DataFrame with two columns:

: representing the release decade (like , , , etc.)
: the number of movies released during that decade

Sort the final output by and respectively — both in ascending order.

DataFrame:

Column Name	Description	Type
title	The title of the movie	object
release_date	The release date of the movie	datetime64[ns]
genres	The genres of the movie	object
budget	The budget of the movie	int64
revenue	The revenue generated by the movie	int64
popularity	The popularity score of the movie	float64

Example Input:

title	release_date	genres	budget	revenue	popularity
Minions	2015-06-17	Family Animation Adventure Comedy	74000000	1156730962	875.581305
Interstellar	2014-11-05	Adventure Drama Science Fiction	165000000	675120017	724.247784
Deadpool	2016-02-09	Action Adventure Comedy	58000000	783112979	514.5699559999998
Guardians of the Galaxy	2014-07-30	Action Science Fiction Adventure	170000000	773328629	481.098624
Mad Max: Fury Road	2015-05-13	Action Adventure Science Fiction Thriller	150000000	378858340	434.278564
Avatar	2009-12-10	Action Adventure Fantasy Science Fiction	237000000	2787965087	150.437577
Fight Club	1999-10-15	Drama	63000000	100853753	146.75739099999996

Example Output:

decade	count
1990	1
2000	1
2010	5

Solution

Approach 1: Grouping By Release Year

When it comes to deciding which movies belong to which decade, we only care about the year they were released in. So, the first step is to extract the year from the column.

To do this, pandas provides the accessor, which lets us work with datetime objects. We will create a new column called that will store the extracted year for each movie.

Here’s how to do that:

Here is the example input after creating a new column called :

title	release_date	genres	budget	revenue	popularity	release_year
Minions	2015-06-17	Family Animation Adventure Comedy	74000000	1156730962	875.581305	2015
Interstellar	2014-11-05	Adventure Drama Science Fiction	165000000	675120017	724.247784	2014
Deadpool	2016-02-09	Action Adventure Comedy	58000000	783112979	514.5699559999998	2016
Guardians of the Galaxy	2014-07-30	Action Science Fiction Adventure	170000000	773328629	481.098624	2014
Mad Max: Fury Road	2015-05-13	Action Adventure Science Fiction Thriller	150000000	378858340	434.278564	2015
Avatar	2009-12-10	Action Adventure Fantasy Science Fiction	237000000	2787965087	150.437577	2009
Fight Club	1999-10-15	Drama	63000000	100853753	146.75739099999996	1999

Step 1: Reducing Year to Decade

Now we have the year, but we want to group the years into decades. For example:

1995 → 1990

2013 → 2010

To achieve this, we can use a math trick:

We divide the year by 10 to remove the last digit (integer division), then multiply it by 10 to get the decade. This way, years like 1991, 1992, …, 1999 will all be reduced to 1990.

We will create a new column called that will contain these decade values.

Here’s how to do that:

Here is the example input after creating a new column called :

title	release_date	genres	budget	revenue	popularity	release_year	decade
Minions	2015-06-17	Family Animation Adventure Comedy	74000000	1156730962	875.581305	2015	2010
Interstellar	2014-11-05	Adventure Drama Science Fiction	165000000	675120017	724.247784	2014	2010
Deadpool	2016-02-09	Action Adventure Comedy	58000000	783112979	514.5699559999998	2016	2010
Guardians of the Galaxy	2014-07-30	Action Science Fiction Adventure	170000000	773328629	481.098624	2014	2010
Mad Max: Fury Road	2015-05-13	Action Adventure Science Fiction Thriller	150000000	378858340	434.278564	2015	2010
Avatar	2009-12-10	Action Adventure Fantasy Science Fiction	237000000	2787965087	150.437577	2009	2000
Fight Club	1999-10-15	Drama	63000000	100853753	146.75739099999996	1999	1990

Step 2: Group and Count

Now that we have a column (where years like 1991, 1992, …, 1999 are all reduced to 1990), we can group the DataFrame by this column and count the number of movies in each decade.

We’ll use the method. The method will return a special grouped object (not a regular DataFrame). On this object, we can call the method to get the count of rows (movies) in each group. This count will be returned as a Series, and to turn it back into a DataFrame with named columns, we use and give the count column a name (like ).

If you’re not familiar with , you can learn more here.

Here’s how to do that:

Here is the example input after grouping and counting:

decade	count
1990	1
2010	5
2000	1

Step 3: Sorting

Finally, we need to sort the result first by the column and then by the column. For this step, we will use Pandas’ method.

When sorting a DataFrame by more than one column, we pass a list of column names to the argument in the order we want to sort the DataFrame.

If you’re not familiar with , you can learn more here.

Here is the full code:

Here is the example input after sorting by and respectively:

decade	count
1990	1
2000	1
2010	5

Approach 2: Using

Instead of using math to group years into decades (like in the first approach), we can use the method in pandas. This method allows us to divide numeric data into bins (intervals).

What are bins? Bins are ranges of values — for example, a bin for years 1990–2000 includes all years starting from 1990 up to (but not including) 2000.
Why use ? With , we don’t need to do manual math to calculate the decade — it automatically places each year in the correct bin based on the ranges we define.

If you are not familiar with , you can learn more here.

Step 1: Creating Bins and Decade Labels

We first find the minimum and maximum years in the data. Then, we calculate the minimum decade and maximum decade to make sure our bins cover the entire range of decades.

For example:

If the minimum year is 1975, the minimum decade is 1970.
If the maximum year is 2015, the maximum decade is 2010.

To create the list, we use , which generates numbers from the first decade to the last decade (inclusive), in steps of 10 years (a decade). For example, will create:

We use these same numbers as labels for the decades, like , , etc. To do this, we take all the bin edges except the last one (because the number of intervals is always one less than the number of bin edges). So, we use:

Step 2: Count the Number of Movies in Each Decade

Now that we have a column with numeric labels, we can count the number of movies in each decade using .

Here’s the Full Code:

Sourced from

Uber

Difficulty

Easy

Input

PySpark

Output