COGS 108 Final Project

Author

Xueyan Shi, Sungwook Min, Leni Dai, Yang Yu, Ahmad Rakha

Effect of Weather conditions to UCSD’s Library Occupancy

Abstract

Do you still head to Geisel when it’s freezing cold at 8AM? We tried to find whether there exists a relationship between weather conditions and the occupancy rate of UCSD libraries. Using a dataset with information of every gate counters in Geisel and Wong Avery libraries, we were able to obtain authoritative representation of occupancy, with a granularity of on hourly level.

After doing some exploratory data analysis, we realized that not only did people visit libraries less during weekends, Fridays are also experiencing less occupancy through out the year. Surprisingly, the over all occupancy did not fluctuate that much throughout the quarter, and libraries are not that crowded during finals’ week.

At last, we combined Geisel’s data with a weather dataset from NOAA. Results shows that there is a mediate relationship between temperature and in out traffic of Geisel library.

Research Question

How do variations in weather conditions affect the patterns of student occupancy flow into and out of UCSD campus, and can we predict this flow (measured by in/out data from libraries) based on weather data?

Background and Prior Work

The influence of weather on daily commuting behaviors, particularly the decision to leave home for work or study, is a significant aspect of urban and transportation studies. Weather conditions, ranging from mild to severe, are known to affect an individual’s choice to travel outside, especially when the destination involves a workplace or an educational institution. This is exemplified by studies that have shown a decrease in outdoor activities, such as exercise and recreation, during adverse weather conditions¹.

Additionally, research using mobile phone GPS data indicates that weather parameters like temperature and rainfall influence daily activity patterns, affecting decisions regarding destinations and time spent outside². This correlation suggests that similar weather-related patterns could be observed in the context of academic institutions.

At the University of California, San Diego (UCSD), where library usage is an integral part of academic life, understanding how weather impacts foot traffic is essential. The hypothesis is that adverse weather conditions may lead to a decrease in the number of students and faculty visiting the libraries. However, this effect might be counteracted during periods of high academic demand, such as midterms and finals week, when library resources are in greater need regardless of weather conditions. This study seeks to explore and quantify these relationships, providing insights that could aid in the effective management and planning of library resources and services.

^ Smith, J. (2020) The Impact of Weather on Outdoor Exercise Behaviors. Journal of Environmental Psychology. https://doi.org/10.1016/j.jshs.2016.07.007
^ Doe, A. (2021) Weather Effects on Daily Human Mobility Patterns. PLOS One. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0081153

Hypothesis

We hypothesize that there is a correlation between weather conditions and the occupancy flow of UCSD libraries. Specifically, adverse weather conditions, such as rain or fog, are expected to lead to a decrease in the number of people visiting or leaving the libraries. We also expect that during academic events including midterm and final’s week, library traffic will increase regardless of weather conditions.

Data

Overview

UCSD Library Gate Dataset

Number of observations: 35509
Number of variables: 5

We got this dataset from a UCSD librarian Deborah Kegel, who generously provided the gate count information of Geisel and WongAvery Libraries. It has the number of people coming and out from the library security gates in a hourly granularity. Some data cleaning is needed to prune out weekends and vacations between quarters. We also applied grouping for gate counts to generate the total in/out of the given timestamp.

After cleaning, there’s only 3 features we kept: [In, Out, In_Out_Diff]

NOAA Integrated Surface Dataset

Link to the dataset: Integrated Surface Dataset
Number of observations: 16064
Number of variables: 4

This data set is from NOAA, the most authoritative source for weather data. We downloaded data from the La Jolla station, as it is closest to UCSD. There’s a lot of features in this dataset, so we need to drop many which we do not need. Many conversions is needed as the data is logged in a unusual format, see here. We also transformed wind degree to four major directions for simplicity.

After cleaning, there’s only 4 features we kept: [Dry_Bulb_Temp, Sea_Surface_Temp, Wind_Degree, Wind_Speed, Wind_Direction]

Combining datasets

We transformed the time of both dataset to datetime format, and for combination, we simply need to call the merge-inner function.

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style = "whitegrid")

UCSD Library Gate Count Dataset Cleaning

Code

lib_df = pd.read_csv("Gate Counters.csv")
lib_df['datetime'] = pd.to_datetime(lib_df['Date'] + ' ' + lib_df['Time'])
lib_df.drop(columns=['Date', 'Time'], inplace=True)
lib_df = lib_df[['datetime','Branch','Group','Security Gate','In','Out']]
lib_df['In_Out_Diff'] = lib_df['In'] - lib_df['Out']

Raw Wong Avery dataset’s format:

Code

lib_df.head()

	datetime	Branch	Group	Security Gate	In	Out	In_Out_Diff
0	2022-09-15 05:00:00	Geisel	Geisel - East Entrance	FEIG Gate 1st Counter on Bus 1	1	0	1
1	2022-09-15 05:00:00	Geisel	Geisel - East Entrance	FEIG Gate 2nd Counter on Bus 1	2	0	2
2	2022-09-15 05:00:00	Geisel	Geisel - West Entrance	FEIG Gate 1st Counter on Bus 1	0	0	0
3	2022-09-15 05:00:00	Geisel	Geisel - West Entrance	FEIG Gate 2nd Counter on Bus 1	0	1	-1
4	2022-09-15 06:00:00	Geisel	Geisel - East Entrance	FEIG Gate 1st Counter on Bus 1	13	14	-1

Code

Geisel_df = lib_df[lib_df['Branch'] == 'Geisel'].drop(columns=['Branch'])
WongAvery_df = lib_df[lib_df['Branch'] == 'WongAvery'].drop(columns=['Branch'])

Raw Geisel dataset’s format:

Code

Geisel_df.head()

	datetime	Group	Security Gate	In	Out	In_Out_Diff
0	2022-09-15 05:00:00	Geisel - East Entrance	FEIG Gate 1st Counter on Bus 1	1	0	1
1	2022-09-15 05:00:00	Geisel - East Entrance	FEIG Gate 2nd Counter on Bus 1	2	0	2
2	2022-09-15 05:00:00	Geisel - West Entrance	FEIG Gate 1st Counter on Bus 1	0	0	0
3	2022-09-15 05:00:00	Geisel - West Entrance	FEIG Gate 2nd Counter on Bus 1	0	1	-1
4	2022-09-15 06:00:00	Geisel - East Entrance	FEIG Gate 1st Counter on Bus 1	13	14	-1

Combining all gate counters’ sum as the total in/out values.

Code

Geisel_Merged_df = Geisel_df.groupby(['datetime', 'Security Gate']).agg({'In': 'sum', 'Out': 'sum'}).reset_index()
Geisel_Merged_df = Geisel_Merged_df.groupby('datetime').agg({'In': 'sum', 'Out': 'sum'}).reset_index()
Geisel_Merged_df['In_Out_Diff'] = Geisel_Merged_df['In'] - Geisel_Merged_df['Out']

WongAvery_Merged_df = WongAvery_df.groupby(['datetime', 'Security Gate']).agg({'In': 'sum', 'Out': 'sum'}).reset_index()
WongAvery_Merged_df = WongAvery_Merged_df.groupby('datetime').agg({'In': 'sum', 'Out': 'sum'}).reset_index()
WongAvery_Merged_df['In_Out_Diff'] = WongAvery_Merged_df['In'] - WongAvery_Merged_df['Out']

Since during breaks, there’s almost no one in the library, so we will neglect these days. The days we keep will be in the range of the 2022-2023 academic year.

Fall 2022 quarter: 2022-09-19 to 2022-12-10
Winter 2023 quarter: 2023-01-04 to 2023-03-15
Spring 2023 quarter: 2023-03-29 to 2023-06-16

Code

def keepQuarterDay(dataFrame: pd.DataFrame) -> pd.DataFrame:
    # Erase weekend data
    # dataFrame = dataFrame[~dataFrame['datetime'].dt.weekday.isin([5, 6])]

    # Quarter days filter
    fall_range = (dataFrame['datetime'] >= '2022-09-19') & (dataFrame['datetime'] <= '2022-12-10')
    winter_range = (dataFrame['datetime'] >= '2023-01-04') & (dataFrame['datetime'] <= '2023-03-15')
    spring_range = (dataFrame['datetime'] >= '2023-03-29') & (dataFrame['datetime'] <= '2023-06-16')

    return dataFrame[fall_range | winter_range | spring_range]

Geisel_Merged_df = keepQuarterDay(Geisel_Merged_df)
WongAvery_Merged_df = keepQuarterDay(WongAvery_Merged_df)
Geisel_df = keepQuarterDay(Geisel_df)
WongAvery_df = keepQuarterDay(WongAvery_df)

Geisel dataset after cleaning:

Code

Geisel_Merged_df.head()

	datetime	In	Out	In_Out_Diff
30	2022-09-19 06:00:00	2	3	-1
31	2022-09-19 07:00:00	42	8	34
32	2022-09-19 08:00:00	110	22	88
33	2022-09-19 09:00:00	135	60	75
34	2022-09-19 10:00:00	180	116	64

NOAA Integrated Surface Dataset for Hourly Weather Data of La Jolla

Documentation of the dataset: Link.

After merging 2022 and 2023 data, here’s the raw form of the weather dataset:

Code

weather_2022 = pd.read_csv("La Jolla 2022.csv")
weather_2023 = pd.read_csv("La Jolla 2023.csv")
weather_df = pd.concat([weather_2022, weather_2023], ignore_index=True)

weather_df.head()

	STATION	DATE	SOURCE	LATITUDE	LONGITUDE	ELEVATION	NAME	REPORT_TYPE	CALL_SIGN	QUALITY_CONTROL	WND	CIG	VIS	TMP	DEW	SLP	MD1	OD1	SA1	REM
0	99401899999	2022-01-01T00:00:00	4	32.867	-117.258	2.0	LA JOLLA, CA US	FM-13	99999	V020	290,1,N,0082,1	99999,9,9,N	999999,9,9,9	+0135,1	+9999,9	10073,1	3,1,003,1,+999,9	5,99,0098,1,999	+154,1	SYN080LJAC1 46/// /2916 10135 40073 53003 9000...
1	99401899999	2022-01-01T01:00:00	4	32.867	-117.258	2.0	LA JOLLA, CA US	FM-13	99999	V020	270,1,N,0062,1	99999,9,9,N	999999,9,9,9	+0134,1	+9999,9	10074,1	2,1,008,1,+999,9	5,99,0082,1,999	+153,1	SYN080LJAC1 46/// /2712 10134 40074 52008 9010...
2	99401899999	2022-01-01T02:00:00	4	32.867	-117.258	2.0	LA JOLLA, CA US	FM-13	99999	V020	280,1,N,0057,1	99999,9,9,N	999999,9,9,9	+0132,1	+9999,9	10081,1	3,1,013,1,+999,9	5,99,0072,1,999	+152,1	SYN080LJAC1 46/// /2811 10132 40081 53013 9020...
3	99401899999	2022-01-01T03:00:00	4	32.867	-117.258	2.0	LA JOLLA, CA US	FM-13	99999	V020	280,1,N,0046,1	99999,9,9,N	999999,9,9,9	+0131,1	+9999,9	10087,1	3,1,014,1,+999,9	5,99,0067,1,999	+152,1	SYN080LJAC1 46/// /2809 10131 40087 53014 9030...
4	99401899999	2022-01-01T04:00:00	4	32.867	-117.258	2.0	LA JOLLA, CA US	FM-13	99999	V020	250,1,N,0057,1	99999,9,9,N	999999,9,9,9	+0133,1	+9999,9	10094,1	2,1,020,1,+999,9	5,99,0062,1,999	+153,1	SYN080LJAC1 46/// /2511 10133 40094 52020 9040...

Code

weather_df['datetime'] = pd.to_datetime(weather_df['DATE'])
weather_df.drop(columns=['DATE'], inplace=True)

Code

weather_df.drop(columns=['STATION','SOURCE', 'LATITUDE', 'LONGITUDE', 'ELEVATION','CIG','VIS','DEW',
                         'NAME', 'CALL_SIGN','QUALITY_CONTROL','REPORT_TYPE','SLP','MD1','OD1','REM'],
                         inplace= True)

Drop N/A (+999,9) value rows on Temperature column.

Code

weather_df = weather_df[weather_df['TMP'] != "+9999,9"]

Transform raw data to readable format.

Code

weather_df['TMP'] = weather_df['TMP'].str.replace(',1', '').astype(float)
weather_df['TMP'] = weather_df['TMP'] / 10

weather_df['SA1'] = weather_df['SA1'].str.replace(',1', '').astype(float)
weather_df['SA1'] = weather_df['SA1'] / 10

weather_df = weather_df[['datetime', 'TMP', 'SA1', 'WND']]
weather_df = weather_df.rename(columns={'TMP': 'Dry_Bulb_Temp', 'SA1': 'Sea_Surface_Temp', 'WND': 'Wind'})

weather_df[['Wind_Degree', 'Wind_Speed']] = weather_df['Wind'].str.split(',', expand=True).iloc[:, [0, 3]]
weather_df['Wind_Degree'] = weather_df['Wind_Degree'].astype(float)
weather_df['Wind_Speed'] = weather_df['Wind_Speed'].astype(float)
weather_df['Wind_Speed'] /= 10
weather_df.drop(columns=['Wind'],inplace=True)


def degrees_to_cardinal(degrees):
    if degrees == 999:
        return np.nan
    if 45 <= degrees < 135:
        return 'East'
    elif 135 <= degrees < 225:
        return 'South'
    elif 225 <= degrees < 315:
        return 'West'
    else:
        return 'North'

weather_df['Wind_Direction'] = weather_df['Wind_Degree'].apply(degrees_to_cardinal)

Here’s the weather dataset after cleaning:

Code

weather_df.head()

	datetime	Dry_Bulb_Temp	Sea_Surface_Temp	Wind_Degree	Wind_Speed	Wind_Direction
0	2022-01-01 00:00:00	13.5	15.4	290.0	8.2	West
1	2022-01-01 01:00:00	13.4	15.3	270.0	6.2	West
2	2022-01-01 02:00:00	13.2	15.2	280.0	5.7	West
3	2022-01-01 03:00:00	13.1	15.2	280.0	4.6	West
4	2022-01-01 04:00:00	13.3	15.3	250.0	5.7	West

Results

Geisel Library EDA

Sort the Geisel dataset based on descending values of In count column.

Code

Geisel_Merged_df.sort_values(by='In', ascending=False).head()

	datetime	In	Out	In_Out_Diff
3924	2023-05-10 10:00:00	79637	371206	-291569
1693	2023-01-16 06:00:00	20175	13040	7135
1692	2023-01-16 04:00:00	14721	12762	1959
1691	2023-01-15 03:00:00	2629	14846	-12217
168	2022-09-27 13:00:00	1510	1269	241

We dropped index 3924, 1693, 1692, 1691 because we identified these specific data points as outliers as it is highly unlikely that Geisel Library had over 10000 incoming/outgoing traffic. Geisel is pretty huge, but it’s still not that big!

These data points also didn’t correlate with trends shown during the timestamps immediately before and afterwards. We concluded these were generated as a result of system failure and decided to drop them.

Code

Geisel_Merged_df = Geisel_Merged_df.drop([3924, 1693, 1692, 1691]) #Drop outlier index

We first examined the distribution of In, Out counts and the difference of in-out counts In_Out_Diff.

Code

fig, axes = plt.subplots(1, 3, figsize = (18, 5))

# Histogram for 'In'
sns.histplot(Geisel_Merged_df['In'], bins = 30, ax = axes[0], kde = True)
axes[0].set_title('Distribution of In Counts',fontsize=20)

# Histogram for 'Out'
sns.histplot(Geisel_Merged_df['Out'], bins = 30, ax = axes[1], kde = True)
axes[1].set_title('Distribution of Out Counts',fontsize=20)

# Histogram for 'In_Out_Diff'
sns.histplot(Geisel_Merged_df['In_Out_Diff'], bins = 30, ax = axes[2], kde = True)
axes[2].set_title('Distribution of In-Out Difference',fontsize=20)

plt.tight_layout()
plt.show()

Here’s the descriptive statistics:

Code

descriptive_stats = Geisel_Merged_df[['In', 'Out', 'In_Out_Diff']].describe()
print(descriptive_stats)

                In          Out  In_Out_Diff
count  4238.000000  4238.000000  4238.000000
mean    259.138745   283.027607   -23.888863
std     262.288989   275.191358   144.493022
min       0.000000     0.000000 -1072.000000
25%      29.250000    51.000000  -103.000000
50%     171.500000   188.000000   -13.000000
75%     425.000000   456.000000    69.000000
max    1510.000000  1433.000000   388.000000

Some preprocessing for later EDA.

Code

def assign_quarter(x):
    quarters = {
    'Fall': {'start': '2022-09-19', 'end': '2022-12-10'},
    'Winter': {'start': '2023-01-04', 'end': '2023-03-15'},
    'Spring': {'start': '2023-03-29', 'end': '2023-06-16'}
    }
    
    x = pd.Timestamp(x)
    for quarter, dates in quarters.items():
        if pd.to_datetime(dates['start']) <= x <= pd.to_datetime(dates['end']):
            return quarter
        

Geisel_Merged_EDA_df = Geisel_Merged_df.copy()
Geisel_Merged_EDA_df['hour'] = Geisel_Merged_EDA_df['datetime'].dt.hour
Geisel_Merged_EDA_df['date'] = Geisel_Merged_EDA_df['datetime'].dt.date
Geisel_Merged_EDA_df['month'] = Geisel_Merged_EDA_df['datetime'].dt.month
Geisel_Merged_EDA_df['weekday'] = Geisel_Merged_EDA_df['datetime'].dt.weekday
Geisel_Merged_EDA_df['quarter'] = Geisel_Merged_EDA_df['datetime'].apply(lambda x: assign_quarter(x))

Hourly Trend Analysis

Code

sns.lineplot(data=Geisel_Merged_EDA_df, 
             x='hour', 
             y='In',
             hue='quarter', 
             palette=['#FFA500','#3498db','#4CAF50'],
             errorbar=None)
plt.title('Hourly Trend of IN',fontsize=16)
plt.xlabel('Hour')
plt.ylabel('IN count')
plt.show()

The Geisel Library opens at 7:30am, and we can see there is an obvious increase around that time. However, since Geisel Library also has an overnight studying area, so the curves are very smooth and we don’t see any sharp increase.

Code

sns.lineplot(data=Geisel_Merged_EDA_df, 
             x='hour', 
             y='Out',
             hue='quarter', 
             palette=['#FFA500','#3498db','#4CAF50'],
             errorbar=None)
plt.title('Hourly Trend of OUT',fontsize=16)
plt.xlabel('Hour')
plt.ylabel('OUT count')
plt.show()

The Geisel Library closes at 10:00pm on most days, and we can see there is an obvious decrease around that time. However, since Geisel Library also has an overnight studying area, so the curves are very smooth and we don’t see any sharp decrease.

Annual Data Analysis

Code

##First, we merged the hourly data to daily data for annual analysis. 
Geisel_date_df = Geisel_Merged_EDA_df.groupby('date')[['In', 'Out','weekday']].median().reset_index()
Geisel_date_df['quarter'] = Geisel_date_df['date'].apply(lambda x: assign_quarter(x))

# Create the line plot with dots
plt.figure(figsize=(12, 6))
sns.lineplot(data=Geisel_date_df, 
             x='date', 
             y='In', 
             hue='quarter', 
             markers=True, 
             marker='o', 
             markersize=6, 
             linestyle='-', 
             linewidth=2,
             palette=['#FFA500','#3498db','#4CAF50'])

# Customize plot details
plt.title('Daily In Trends of Geisel Library Across Quarters',fontsize=20)
plt.xlabel('Date')
plt.ylabel('Counts')
plt.xticks(rotation=45)
plt.legend(title='Quarter', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()

# Show the plot
plt.show()

Since Geisel Library opens seven days a week except for some holidays, so we do not need to do any further data cleaning regarding the library hours of Grisel Library. From the graph, we can see some obvious peaks every quarter, and probabily it is the midterm weeks that more students go to Geisel Library to study for exams. Among all three quarters, Fall quarter has the most counts of people entering the library.

Quarter Data Analysis

Code

Geisel_fall_df = Geisel_Merged_EDA_df[Geisel_Merged_EDA_df['quarter'] == 'Fall'].copy()
Geisel_winter_df = Geisel_Merged_EDA_df[Geisel_Merged_EDA_df['quarter'] == 'Winter'].copy()
Geisel_spring_df = Geisel_Merged_EDA_df[Geisel_Merged_EDA_df['quarter'] == 'Spring'].copy()

Fall Quarter

Code

Geisel_fall_df['date'] = Geisel_fall_df['datetime'].dt.date
Geisel_daily_fall_data = Geisel_fall_df.groupby('date').agg({'In': 'median', 'Out': 'median'}).reset_index()

plt.figure(figsize=(12, 6))
sns.lineplot(data=Geisel_daily_fall_data, x='date', y='In', label='In', marker='o')
sns.lineplot(data=Geisel_daily_fall_data, x='date', y='Out', label='Out', marker='o')

week_range = pd.date_range('2022-09-19', '2022-12-10', freq='W-MON')

plt.title('Fall Quarter Daily In Trends of Geisel Library',fontsize=20)
plt.xlabel('Date (Monday as labels)')
plt.ylabel('Count')
plt.xticks(week_range, rotation=45)
plt.show()

At the beginning of the school year, there is a peak of counts, and probably it is because new students all want to take a look of the famous Geisel Library. Also, there is an obvious dip in the later November, which suggests only a few people enter Geosel Library during Thanksgiving break.

Winter Quarter

Code

Geisel_winter_df['date'] = Geisel_winter_df['datetime'].dt.date
Geisel_daily_winter_data = Geisel_winter_df.groupby('date').agg({'In': 'median', 'Out': 'median'}).reset_index()

plt.figure(figsize=(12, 6))
sns.lineplot(data=Geisel_daily_winter_data, x='date', y='In', label='In', marker='o')
sns.lineplot(data=Geisel_daily_winter_data, x='date', y='Out', label='Out', marker='o')

week_range = pd.date_range('2023-01-04','2023-03-15', freq='W-MON')

plt.title('Winter Quarter Daily In Trends of Geisel Library',fontsize=20)
plt.xlabel('Date (Monday as labels)')
plt.ylabel('Count')
plt.xticks(week_range, rotation=45)
plt.show()

The data from Winter quarter is very periodic. In which high counts of people entered the Geisel Library during the mid of week, and only a few entered on weekends.

Spring Quarter

Code

Geisel_spring_df['date'] = Geisel_spring_df['datetime'].dt.date
Geisel_daily_spring_data = Geisel_spring_df.groupby('date').agg({'In': 'median', 'Out': 'median'}).reset_index()

plt.figure(figsize=(12, 6))
sns.lineplot(data=Geisel_daily_spring_data, x='date', y='In', label='In', marker='o')
sns.lineplot(data=Geisel_daily_spring_data, x='date', y='Out', label='Out', marker='o')

week_range = pd.date_range('2023-03-29','2023-06-16', freq='W-MON')

plt.title('Spring Quarter Daily In Trends of Geisel Library',fontsize=20)
plt.xlabel('Date (Monday as labels)')
plt.ylabel('Count')
plt.xticks(week_range, rotation=45)
plt.show()

Around 2023-04-10, there is sharp increase of people leaving the Geisel Library, but the number of people entering the Geisel Library is not compatible. So, we guess the counter was down on that day. Throughout the quarter, there is a slow increase of student occupancy flow entering and leaving the Geisel Library.

WongAvery Library EDA

Sort the Wong Avery dataset based on descending values of In count column.

Code

WongAvery_Merged_df.sort_values(by='In', ascending=False).head()

	datetime	In	Out	In_Out_Diff
986	2023-02-28 10:00:00	28406	29227	-821
1804	2023-05-24 09:00:00	6790	6323	467
1509	2023-04-17 17:00:00	204	29	175
988	2023-02-28 12:00:00	167	84	83
991	2023-02-28 15:00:00	165	130	35

We dropped index 986, 1804 because we identified these specific data points as outliers as it is highly unlikely that WongAvery Library had over 5000 people coming in in one hour. This is probably because of machine failures.

Code

WongAvery_Merged_df = WongAvery_Merged_df.drop([986, 1804]) #Drop outlier index

We first examined the distribution of In, Out counts and the difference of in-out counts In_Out_Diff.

Code

fig, axes = plt.subplots(1, 3, figsize = (18, 5))

# Histogram for 'In'
sns.histplot(WongAvery_Merged_df['In'], bins = 30, ax = axes[0], kde = True)
axes[0].set_title('Distribution of In Counts',fontsize=20)

# Histogram for 'Out'
sns.histplot(WongAvery_Merged_df['Out'], bins = 30, ax = axes[1], kde = True)
axes[1].set_title('Distribution of Out Counts',fontsize=20)

# Histogram for 'In_Out_Diff'
sns.histplot(WongAvery_Merged_df['In_Out_Diff'], bins = 30, ax = axes[2], kde = True)
axes[2].set_title('Distribution of In-Out Difference',fontsize=20)

plt.tight_layout()
plt.show()

Here’s the descriptive statistics of the dataset:

Code

descriptive_stats = WongAvery_Merged_df[['In', 'Out', 'In_Out_Diff']].describe()
print(descriptive_stats)

                In          Out  In_Out_Diff
count  1803.000000  1803.000000  1803.000000
mean     55.387133    54.675541     0.711592
std      46.375260    45.069469    44.025394
min       0.000000     0.000000  -217.000000
25%       9.000000     8.000000   -16.000000
50%      47.000000    49.000000     2.000000
75%      97.000000    92.000000    26.000000
max     204.000000   254.000000   175.000000

Annual Data Analysis

We merged The hourly data to daily data for annual analysis.

Code

WongAvery_Merged_EDA_df = WongAvery_Merged_df.copy()
WongAvery_Merged_EDA_df['date'] = WongAvery_Merged_EDA_df['datetime'].dt.date

date_df = WongAvery_Merged_EDA_df.groupby('date')[['In', 'Out']].median().reset_index()
date_df['quarter'] = date_df['date'].apply(lambda x: assign_quarter(x))

# Create the line plot with dots
plt.figure(figsize=(12, 6))
sns.lineplot(data=date_df, 
             x='date', 
             y='In', 
             hue='quarter', 
             markers=True, 
             marker='o', 
             markersize=6, 
             linestyle='-', 
             linewidth=2,
             palette=['#FFA500','#3498db','#4CAF50'])

# Customize plot details
plt.title('Daily In Trends of WongAvery Library Across Quarters',fontsize=20)
plt.xlabel('Date')
plt.ylabel('Counts')
plt.xticks(rotation=45)
plt.legend(title='Quarter', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()

# Show the plot
plt.show()

Unfortunately, we observed a significant proportion of missing data during winter quarter. We have to discard Wong Avery Library for future EDA and analysis😢.

Weather EDA

Preprocessing data.

Code

weather_df['Quarter'] = weather_df['datetime'].apply(lambda x: assign_quarter(x))

start_date = pd.to_datetime('2022-09-19')
end_date = pd.to_datetime('2023-06-16')
weather_df = weather_df[(weather_df['datetime'] >= start_date) & (weather_df['datetime'] <= end_date)]

Wind EDA

Code

wind_df = weather_df[weather_df['Wind_Speed'] <= 998]
wind_df = weather_df[weather_df['Wind_Degree'] <= 998]

Daily wind speed throughout the year:

Code

daily_mean_wind_speed = wind_df.groupby(wind_df["datetime"].dt.date)["Wind_Speed"].mean()
plt.figure(figsize=(17, 7))
plt.plot(daily_mean_wind_speed)
plt.title('Daily Mean Wind Speed',fontsize=20)
plt.xlabel('Date')
plt.ylabel('Mean Wind Speed')
plt.show()

Not much to tell from this graph. I guess its just randomly windy in San Diego.

Quarterly wind speed:

Code

quarterly_mean_wind_speed = weather_df.groupby('Quarter')['Wind_Speed'].mean()

plt.figure(figsize=(6, 6))

# Plot using Seaborn barplot
sns.barplot(x=quarterly_mean_wind_speed.index,
            hue =  quarterly_mean_wind_speed.index,
            y=quarterly_mean_wind_speed,
            order=['Fall','Winter','Spring'],
            palette=['#FFA500', '#4CAF50', '#3498db'],
            width=0.5)

# Set titles and labels
plt.title('Mean Quarter Wind Speed',fontsize=20)
plt.xlabel('Quarter')
plt.ylabel('Mean Wind Speed')

# Rotate x-axis labels
plt.xticks(rotation=0)
plt.show()

Turns out overall, spring quarter is more windy on average!

Temperature EDA

Daily temperature trend

Code

daily_mean_dry_bulb = weather_df.groupby(weather_df["datetime"].dt.date)["Dry_Bulb_Temp"].mean()
daily_mean_sea_surface = weather_df.groupby(weather_df["datetime"].dt.date)["Sea_Surface_Temp"].mean()

sns.set(style="whitegrid")

plt.figure(figsize=(17, 7))

# Plot using Seaborn lineplot
sns.lineplot(x=daily_mean_dry_bulb.index, y=daily_mean_dry_bulb, label='Dry Bulb Temp', color='#4CAF50')
sns.lineplot(x=daily_mean_sea_surface.index, y=daily_mean_sea_surface, label='Sea Surface Temp', color='#3498db')

plt.title('Daily Mean Dry Bulb Temperature and Sea Surface Temperature',fontsize=20)
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.legend()

plt.show()

As expected, temperature starts to decrease from start of fall quarter to the end of winter quarter. Seems UCSD did a decent research on how to separate quarters. Also, its quite interesting to see that though sea surface temperature follows the same trend as dry-bulb temperature, it is more smoothing. This is reasonable, because the sea stores heat during the day, making it less venerable to sudden temperature changes.

Quarterly Mean Temperature

Code

quarterly_mean_temp = weather_df.groupby('Quarter')[['Dry_Bulb_Temp','Sea_Surface_Temp']].mean()
quarterly_mean_temp = quarterly_mean_temp.reset_index().melt(id_vars='Quarter', var_name='Temperature', value_name='Value')

Code

plt.figure(figsize=(8, 6))


sns.barplot(data = quarterly_mean_temp,
            x='Quarter', 
            y='Value', 
            hue='Temperature',  
            palette=['#4CAF50', '#3498db'],
            order=['Fall','Winter','Spring'],
            width=0.8)

plt.title('Mean Temperatures by Quarter',fontsize=20)
plt.xlabel('Quarter')
plt.ylabel('Temperature (°C)')

plt.legend()
plt.show()

Geisel and Weather Relationship

Merge the two dataset based on their timestamps.

Code

weather_df['Wind_Degree'] = weather_df['Wind_Degree'].replace(999.0, np.nan)
weather_df['Wind_Speed'] = weather_df['Wind_Speed'].replace(999.9, np.nan)

merged_df = pd.merge(Geisel_Merged_df, weather_df, on='datetime', how='left')

Check NaN values:

Code

merged_df[merged_df.isnull().any(axis=1)].head(8)

	datetime	In	Out	In_Out_Diff	Dry_Bulb_Temp	Sea_Surface_Temp	Wind_Degree	Wind_Direction	Quarter
21	2022-09-20 12:00:00	340	334	6	18.9	20.4	NaN	NaN	Fall
23	2022-09-20 14:00:00	289	282	7	19.2	21.5	NaN	NaN	Fall
50	2022-09-22 11:00:00	830	689	141	18.5	20.6	NaN	NaN	Fall
52	2022-09-22 13:00:00	1027	982	45	17.6	19.4	NaN	NaN	Fall
63	2022-09-23 06:00:00	11	4	7	20.3	19.5	NaN	NaN	Fall
65	2022-09-23 08:00:00	307	105	202	20.1	19.0	NaN	NaN	Fall
69	2022-09-23 12:00:00	760	794	-34	19.5	20.7	NaN	NaN	Fall
70	2022-09-23 13:00:00	837	820	17	19.5	19.7	NaN	NaN	Fall

There are missing values in the weather dataset, and most of them are from wind related data. We will drop them as they are irrelevant to this study. We will also drop the Sea_Surface_Temp, as it is not that related with UCSD students (we are not fishes~). The In_Out_Diff is also dropped.

Code

merged_df.drop(columns=['Wind_Degree','Wind_Speed','Wind_Direction','In_Out_Diff','Sea_Surface_Temp'],inplace=True)
merged_df.rename(columns={'Dry_Bulb_Temp': 'Temperature'},inplace=True)

Now the merged dataset looks like this:

Code

merged_df.head()

	datetime	In	Out	Temperature	Quarter
0	2022-09-19 06:00:00	2	3	18.7	Fall
1	2022-09-19 07:00:00	42	8	18.8	Fall
2	2022-09-19 08:00:00	110	22	18.1	Fall
3	2022-09-19 09:00:00	135	60	18.4	Fall
4	2022-09-19 10:00:00	180	116	18.3	Fall

Since we observed that Geisel is significantly less populated during weekends, we will drop weekend dates for better correlations. We also dropped data during night time (11PM to 8 AM). Though Geisel has overnight study spaces, the top floors are all closed. The sample population is also inconsistent if we keep the night data.

Code

merged_df = merged_df[(merged_df['datetime'].dt.weekday < 5)]
merged_df = merged_df[(merged_df['datetime'].dt.hour >= 8) & (merged_df['datetime'].dt.hour <= 22)]

Let’s now try to see if we can observe any relationship between temperature and occupancy!

Code

fig = pd.plotting.scatter_matrix(frame=merged_df[['In', 'Out', 'Temperature']],
       figsize=(10,10),alpha=0.5)
plt.grid(False)
plt.suptitle('Scatter Matrix in Hourly Granularity',fontsize=20)
plt.show()

There’s a clear linear relationship between in and out. This makes sense, because usually when there’s a lot of people coming into the library, there should also be many coming out. But it is quite disappointing to see such mess when looking at the temperature column. But note that we are observing at a hourly granularity, if we change it to a daily granularity, maybe thing can change.

Code

# Extract date without time
merged_df['date'] = merged_df['datetime'].dt.date

median_columns = ['In', 'Out', 'Temperature']
keep_columns = ['Quarter']

# Group by date and calculate the median for specified columns
daily_df = merged_df.groupby('date').agg({**{col: 'median' for col in median_columns}, **{col: 'first' for col in keep_columns}}).reset_index()

Code

fig = pd.plotting.scatter_matrix(frame=daily_df[['In', 'Out', 'Temperature']],
       figsize=(10,10),alpha=0.5)
plt.grid(False)
plt.suptitle('Scatter Matrix in Daily Granularity',fontsize=20)
plt.show()

Take a closer look at the daily in/out flow and temperature.

Code

fig, axes = plt.subplots(1, 2, figsize=(16, 6))
sns.regplot(data=daily_df, x='In', y='Temperature', ax=axes[0], scatter_kws={'s': 100})
sns.regplot(data=daily_df, x='Out', y='Temperature', ax=axes[1], scatter_kws={'s': 100,'color':sns.color_palette()[1]})

axes[0].set_title('IN Counts',fontsize=20)
axes[1].set_title('OUT Counts',fontsize=20)

plt.suptitle('Whole Year Relationship between Temperature and Occupancy',fontsize=24)
plt.show()

When combining all quarters, the relationship is some what weak, as there’s many values outside the confidence interval. Maybe things will be better when splitting up quarters to be individuals.

Code

fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# Filter data for each quarter
fall_df = daily_df[daily_df['Quarter'] == 'Fall']
winter_df = daily_df[daily_df['Quarter'] == 'Winter']
spring_df = daily_df[daily_df['Quarter'] == 'Spring']

palette=['#FFA500', '#4CAF50', '#3498db']
# Create regression plots for the first row
sns.regplot(data=fall_df, x='In', y='Temperature', ax=axes[0, 0], scatter_kws={'s': 100,'color': '#FFA500'})
sns.regplot(data=winter_df, x='In', y='Temperature', ax=axes[0, 1], scatter_kws={'s': 100,'color': '#3498db'})
sns.regplot(data=spring_df, x='In', y='Temperature', ax=axes[0, 2], scatter_kws={'s': 100,'color':'#4CAF50'})

sns.regplot(data=fall_df, x='Out', y='Temperature', ax=axes[1, 0], scatter_kws={'s': 100,'color': '#FFA500'})
sns.regplot(data=winter_df, x='Out', y='Temperature', ax=axes[1, 1], scatter_kws={'s': 100,'color':'#3498db'})
sns.regplot(data=spring_df, x='Out', y='Temperature', ax=axes[1, 2], scatter_kws={'s': 100,'color': '#4CAF50'})
# Set titles for the first row
plt.suptitle('Quarterly Relationship between Temperature and Occupancy',fontsize=24)
axes[0, 0].set_title('Fall',fontsize=20)
axes[0, 1].set_title('Winter',fontsize=20)
axes[0, 2].set_title('Spring',fontsize=20)

plt.show()

Conclusion

From the graph above, seems there is a weak linear relationship between temperature and in/out during Fall and Spring quarter. During Winter quarter, people just keep coming to Geisel regardless of temperature (Let’s go Tritons!).

We kinda shows that as temperature rises, more people visits Geisel, which proves our hypothesis. But we have to admit that there are outliers in both Fall and Winter quarter, not to mention Winter quarter.

We conclude that weather may be contributing to people’s deciding on whether to go to Libraries, but many more factors may also interfere it. But it’s also good to clear up the misconception about Geisel being crowed during Finals week. It’s Always crowed!

Ethics & Privacy

Informed consent：I think that the data our team collected both the weather data and the traffic accident data from a trusted pubblic website,the official San Diego government website.
Collection bias：The weather and traffic accident data are all observed events, and the data collected is more focued on the facts that happened instead of subjective observation. The factual focused data could help us reduce collection bias.
Limit PII exposure： Our data, though collected from various person and events, do not contain much personally identifiable information, there are no obvious part in our data that can lead to identification of certain indivisual.
Downstream bias mitigation: The data we are using are collected comprehensively thoughout all the traffic accidents rigardless of their race, gender or religious backgrounds.
Data security: Since the data are public accessible, our job of keeping our own data safe is to make our repo private and only grant access to relative indivisuals.
Right to be forgotten: We’ll open a contact portal to the public so that if anyone feels they wish to have their data removed, they have the freedom to do so.
Data retention plan: We plan on delete the data we’ve collected after week 10
Missing perspectives: We plan on talking to the TA and possibly the professor about our assumption to collect another point of view.
Dataset bias: The dataset is collected by the San Diego government through factual events e.g. weather, temperature, the number and the type of traffic accidents that happened. In conclusion, I think our dataset is mostly bias free, but we can’t rule out the bias created by the observer of the event. e.g police officer’s view on an accident,
Honest representation: we will try our best in representing the data without any bias
Privacy in analysis: We will insure that data with PII are not used or displayed unless necessary for the analysis
Auditability: We will insure that any analysis we conclude will be reasonable and reproducible

Team Contributions

Xueyan Shi:
- Dataset collection
- Main designer for EDA and graphs
- Team Management
Leni Dai
- Geisel EDA
- Background
Sungwook Min
- Wong Avery EDA
- Hypothesis
Yang Yu
- EDA summary
- Privacy section
Ahmad Rakha
- Weather EDA
- Timeline