ANALYZING HOTEL BOOKING CANCELLATIONS

Business Context

As a Data Analyst of a company owning a city hotel in a metropolitan area and a resort hotel located near a coastline both within the United States, you've been assigned by your manager to provide insights on the recent rise of hotel booking cancellations in both hotels for the past few months. These events causes the management to suffer financial losses and they would like to know some leads on reasons why customers cancel their room reservations.

Dataset

You've been given a dataset containing a list of bookings recorded in the booking system. Each row consist a booking made for a particular date and it also contains details containing the following:

Importing Python Libraries

Task 1: Import the following python libraries to be used for this task

pandas with alias pd
matplotlib.pyplot with alias plt
seaborn with alias sns

Dataset Loading

Task 2: Load the `.csv` file into a pandas dataframe and store it in a variable named `data`.

Note: For us to find the dataset stored in our drive, you may want to first mount the drive by clicking on the mount button, or by running the given code below

After mounting the drive and uploading the dataset on it, we can import the given dataset.

data = pd.read_csv('/content/drive/MyDrive/Module 5. Lesson 10. Final Task/hotel_bookings_data.csv')

data.head()

data.info()

Data Cleaning

Task 3: Remove all completely duplicated rows in the dataframe `data.`

data[data.duplicated()]

We found that there are around 32 thousand rows that are considered as complete duplicates. We have to first remove them in our data dataframe. Provide the code where the expected output is that our dataframe data no longer have duplicate values.

data = data.drop_duplicates()

print(data.shape)

assert(data.shape == (87300, 24))

Task 4: Check the Columns containing Missing Values

For us to check which columns have missing values, we have to create a pandas series named num_missing_cols that shows a list of columns along with the number of missing values indicated as NaN, and sort based on their values in descending order.

num_missing_cols = data.isna().sum().sort_values(ascending=False)

print(num_missing_cols)

Task 5: Clean the `country` and `children` column by providing a default value

data[data['country'].isna()].head()

data[data['children'].isna()].head()

We can see that there are some rows have missing values indicated as NaN.

In the country column, there might be some guests that do not have a particular country of origin, or if the guest has a particular country of countries to choose from when applying for a hotel booking.
With that kind of scenario, it would be best to keep these rows and provide a value labeled Other to the missing values in the country column.

data['country'].fillna('Other', inplace=True)

On the other hand, there are missing values in the children column. There are a variety of reasons why there are missing values in this column. In this case, you've learned from the company's previous analysis that it's most likely the guests have no children but forgot to fill the data in the column.
In this case, we can impute the missing data by filling out with 0 on the children column.

data['children'].fillna(0, inplace=True)

assert(all(data[['country','children']].isna().sum() == 0))

Task 6: Clean the date-related columns by creating a column with datetime data type.

cols_to_use = ['arrival_date_year','arrival_date_month','arrival_date_day_of_month', 'status_last_update']

data[cols_to_use].head()

data[cols_to_use].info()

We can observe that are three columns that defines the arrival date of a particular guest. On the other hand, we have status_last_update which contains all the date data in a column. Our objective here is to first convert these columns arrival date columns to a string, and concatenate them. Afterwards, convert this concatenated column along with status_last_update to a datetime data type.

First, convert the year and the day columns indicated as arrival_date_year and arrival_date_day_of_month respectively to a string column str.

data['arrival_date_year'] = data['arrival_date_year'].astype(str)data['arrival_date_day_of_month'] = data['arrival_date_day_of_month'].astype(str)

After converting them to str, we merge the three columns into a single column named arrival_date following the format YYYY-Month-DD

col_arrival_dates = ['arrival_date_year','arrival_date_month','arrival_date_day_of_month']data['arrival_date'] = data[col_arrival_dates].apply(lambda x: '-'.join(x), axis=1)

To show our results:

print(data['arrival_date'])

Next, convert the arrival_date and status_last_update column into a datetime data type using pd.to_datetime.

data['arrival_date'] = pd.to_datetime(data['arrival_date'])

data['status_last_update'] = pd.to_datetime(data['status_last_update'])

Finally, remove the columns listed in col_arrival_dates in the dataframe data.

col_arrival_dates = ['arrival_date_year', 'arrival_date_month', 'arrival_date_day_of_month']data = data.drop(col_arrival_dates, axis=1)

Task 7: Checking for Outliers

To avoid producing skewed results, we would be checking on the validity of our data inputs by removing outliers in the data. Outliers can take up different forms and meanings depending on the type of data at hand, but the common thing about them is that these are the data that are "almost impossible" to happen in the real world and these data can affect our results and summary statistics when left unnoticed.

Show the summary statistics of the column adr in the dataframe data, and store the results into a variable adr_summary.

adr_summary = data['adr'].describe()

print(adr_summary)

Show a plot describing the histogram of adr in the data. Set the y-axis scale to log and give it a title named "Histogram of Hotel Guest ADR".

Looking into our results, we can see that there's a lone data point in the 5000 range, and it wouldn't make sense to have an adr that high, so we check on the raw data by sorting the values in adr.

data['adr'].sort_values(ascending=False)

We can observe that there's one data that is very high compared to the rest. Another thing is that our lowest value is a negative number, so we remove it as well in our dataset.

Remove the rows that contains an adr of 5400 and -6.38.

data = data[~data['adr'].isin([5400, -6.38])]

data['adr'].sort_values(ascending=False)

Data Analysis and Insights

Now that we have cleaned our data, it's time to start on our Data Analysis for this task. The business objective is to learn why guests cancel their hotel room reservations.

For us to explore further, we can show the context of our data through Exploratory Data Analysis (EDA).

Task 8: Show the amount lost from cancelled bookings

To find out how much did the company loss from cancelled hotel bookings, we would look into the data by showing some summary statistics and a bar plot.

First, create a subset dataframe named amount_plotdata that excludes the rows where status is No-Show.

amount_plotdata = data[data['status'] != 'No-Show']

Next, create a seaborn barplot that uses the data amount_plotdata, x with 'status', y with 'adr', and hue with 'deposit_type'.Label the chart title Average Daily Rate of Hotel Bookings, the x label with Booking Status, y label with Average Daily Rate, and the legend title with Deposit Type.

Lastly, we would show the summary statistics using the column adr grouped by status and deposit_type.

amount_plotdata.groupby(['status','deposit_type'])['adr'].describe()

Task 9: Use `.groupby()` method to tally the cancellations and compute the cancellation rates per day.

Create three new columns named arrival_day, arrival_month, and arrival_year that uses arrival_date which gets the day, month, and year respectively.

Next, we create a new column named is_canceled that contains boolean values whether the column status has the value Canceled or not.

data['is_canceled'] = (data['status'] == 'Canceled')

We then use .groupby() method to group the given arrival columns and find the sum and count of cancelled bookings and store the results to a variable named canceled_data.

grouping_cols = ['arrival_year','arrival_month','arrival_day']

canceled_data = data.groupby(grouping_cols)['is_canceled'].agg(['sum','count']).reset_index()

We then create a new column named cancel_rate that is computed by dividing the columns sum and count from the dataframe canceled_data.

cancel_rate = column_of_sum / column_of_count

canceled_data['cancel_rate'] = canceled_data['sum'] / canceled_data['count']

canceled_data.head()

Task 10: Show a line plot of hotel booking cancellation rate

Using the created dataframe named canceled_data, create a lineplot using sns.lineplot containing the given arguments:

data: canceled_data
x: 'arrival_month'
y: 'cancel_rate'
hue: 'arrival_year'
style: 'arrival_year'
markersize: 10
palette: 'Dark2'
ci: None
markers: True
dashes: False
ax:ax

Lastly, set the title named Line Plot of Hotel Cancellation Rates Per Year, x label named Month Number, y label named Cancellation Rate, the legend title named Year.

Task 11: Use `.groupby()` method to tally the cancellations and compute the cancellation rates per deposit type and returning guest.

We then use .groupby() method to group the given columns and find the sum and count of cancelled bookings and store the results to a variable named pointplot_data.

grouping_cols = ['deposit_type','is_repeated_guest','arrival_date']

pointplot_data = data.groupby(grouping_cols)['is_canceled'].agg(['sum','count']).reset_index()

We then create a new column named cancel_rate that is computed by dividing the columns sum and count from the dataframe pointplot_data.

pointplot_data['cancel_rate'] = pointplot_data['sum'] / pointplot_data['count']

Then, we convert the column of is_repeated_guest to a data type bool.

pointplot_data['is_repeated_guest'] = pointplot_data['is_repeated_guest'].astype(bool)

pointplot_data.head()

Task 12: Show a Pointplot of Cancellation Rates by Deposit Type

Using the created dataframe named pointplot_data, create a poinplot using sns.lineplot containing the given arguments:

data: pointplot_data
x: 'deposit_type'
y: 'cancel_rate'
hue: 'is_repeated_guest'
join: False
capsize:0.1
ax:ax

Lastly, set the title named Cancellation Rates by Deposity type and Repeating Guest, x label named Deposit Type, y label named Cancellation Rate, the legend title named Returning Guest.

Conclusions and Recommendations

Overall Conclusion:

Analyzing the data on cancellation rates based on different variables provides valuable insights into guest behavior and preferences. For the Average Daily Rate of Hotel Bookings, we can observed higher average daily rate of Canceled Bookings especially for No Deposit Type. We can also observed increasing hotel cancellatation rates from Year 2016 to 2017. We can obseved that there's a higher cancellation rate for repeating guests for Non-refund deposit type. This indicates that guest characteristics and booking attributes play a crucial role in influencing booking cancellations.

Observations and Recommendations:

Deposit and Repeated Guest Strategies: The observed differences in cancellation rates based on deposit type and repeated guest status highlight the significance of giving incentives for returning guests and offering flexible deposit options. Management can consider loyalty programs, exclusive perks, and transparent refund policies to encourage guests to proceed with bookings and reduce the likelihood of cancellations. The increased cancellation rate among repeating guests for Non-refund deposit types indicates potential areas of improvement. By nurturing a stronger relationship with repeat guests, the hotel can foster loyalty and confidence in their bookings.
Year-Based Strategies: The rising cancellation rates from 2016 to 2017 warrant a closer examination of factors contributing to this trend. Management should analyze the changes in guest behavior, market dynamics, and external influences during this period to tailor strategies that address these shifts and reduce booking cancellations.
Segment-Specific Strategies: By creating tailored experiences and addressing concerns specific to each group, the hotel can encourage more secure bookings and minimize cancellations. Management can collaborate with families to design family-friendly amenities and services, fostering an environment that minimizes uncertainties and boosts family guest confidence.
Data-Driven Adaptation: Continual monitoring of cancellation trends and guest preferences is essential. The management should leverage data insights to identify emerging patterns, anticipate guest needs, and adapt strategies accordingly, creating a more guest-centric approach that reduces booking cancellations.
Guest Feedback Loop: Establishing an effective feedback mechanism can uncover guest-specific reasons for cancellations. Regular surveys and interactions can provide direct insights into guest concerns and help management make necessary adjustments to mitigate cancellations.