Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
set(style = "whitegrid") sns.
Do you still head to Geisel when it’s freezing cold at 8AM? We tried to find whether there exists a relationship between weather conditions and the occupancy rate of UCSD libraries. Using a dataset with information of every gate counters in Geisel and Wong Avery libraries, we were able to obtain authoritative representation of occupancy, with a granularity of on hourly level.
After doing some exploratory data analysis, we realized that not only did people visit libraries less during weekends, Fridays are also experiencing less occupancy through out the year. Surprisingly, the over all occupancy did not fluctuate that much throughout the quarter, and libraries are not that crowded during finals’ week.
At last, we combined Geisel’s data with a weather dataset from NOAA. Results shows that there is a mediate relationship between temperature and in out traffic of Geisel library.
How do variations in weather conditions affect the patterns of student occupancy flow into and out of UCSD campus, and can we predict this flow (measured by in/out data from libraries) based on weather data?
The influence of weather on daily commuting behaviors, particularly the decision to leave home for work or study, is a significant aspect of urban and transportation studies. Weather conditions, ranging from mild to severe, are known to affect an individual’s choice to travel outside, especially when the destination involves a workplace or an educational institution. This is exemplified by studies that have shown a decrease in outdoor activities, such as exercise and recreation, during adverse weather conditions1.
Additionally, research using mobile phone GPS data indicates that weather parameters like temperature and rainfall influence daily activity patterns, affecting decisions regarding destinations and time spent outside2. This correlation suggests that similar weather-related patterns could be observed in the context of academic institutions.
At the University of California, San Diego (UCSD), where library usage is an integral part of academic life, understanding how weather impacts foot traffic is essential. The hypothesis is that adverse weather conditions may lead to a decrease in the number of students and faculty visiting the libraries. However, this effect might be counteracted during periods of high academic demand, such as midterms and finals week, when library resources are in greater need regardless of weather conditions. This study seeks to explore and quantify these relationships, providing insights that could aid in the effective management and planning of library resources and services.
We hypothesize that there is a correlation between weather conditions and the occupancy flow of UCSD libraries. Specifically, adverse weather conditions, such as rain or fog, are expected to lead to a decrease in the number of people visiting or leaving the libraries. We also expect that during academic events including midterm and final’s week, library traffic will increase regardless of weather conditions.
We got this dataset from a UCSD librarian Deborah Kegel, who generously provided the gate count information of Geisel and WongAvery Libraries. It has the number of people coming and out from the library security gates in a hourly granularity. Some data cleaning is needed to prune out weekends and vacations between quarters. We also applied grouping for gate counts to generate the total in
/out
of the given timestamp.
After cleaning, there’s only 3 features we kept: [In, Out, In_Out_Diff]
This data set is from NOAA, the most authoritative source for weather data. We downloaded data from the La Jolla station, as it is closest to UCSD. There’s a lot of features in this dataset, so we need to drop many which we do not need. Many conversions is needed as the data is logged in a unusual format, see here. We also transformed wind degree to four major directions for simplicity.
After cleaning, there’s only 4 features we kept: [Dry_Bulb_Temp, Sea_Surface_Temp, Wind_Degree, Wind_Speed, Wind_Direction]
We transformed the time of both dataset to datetime
format, and for combination, we simply need to call the merge-inner
function.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
set(style = "whitegrid") sns.
= pd.read_csv("Gate Counters.csv")
lib_df 'datetime'] = pd.to_datetime(lib_df['Date'] + ' ' + lib_df['Time'])
lib_df[=['Date', 'Time'], inplace=True)
lib_df.drop(columns= lib_df[['datetime','Branch','Group','Security Gate','In','Out']]
lib_df 'In_Out_Diff'] = lib_df['In'] - lib_df['Out'] lib_df[
Raw Wong Avery dataset’s format:
lib_df.head()
datetime | Branch | Group | Security Gate | In | Out | In_Out_Diff | |
---|---|---|---|---|---|---|---|
0 | 2022-09-15 05:00:00 | Geisel | Geisel - East Entrance | FEIG Gate 1st Counter on Bus 1 | 1 | 0 | 1 |
1 | 2022-09-15 05:00:00 | Geisel | Geisel - East Entrance | FEIG Gate 2nd Counter on Bus 1 | 2 | 0 | 2 |
2 | 2022-09-15 05:00:00 | Geisel | Geisel - West Entrance | FEIG Gate 1st Counter on Bus 1 | 0 | 0 | 0 |
3 | 2022-09-15 05:00:00 | Geisel | Geisel - West Entrance | FEIG Gate 2nd Counter on Bus 1 | 0 | 1 | -1 |
4 | 2022-09-15 06:00:00 | Geisel | Geisel - East Entrance | FEIG Gate 1st Counter on Bus 1 | 13 | 14 | -1 |
= lib_df[lib_df['Branch'] == 'Geisel'].drop(columns=['Branch'])
Geisel_df = lib_df[lib_df['Branch'] == 'WongAvery'].drop(columns=['Branch']) WongAvery_df
Raw Geisel dataset’s format:
Geisel_df.head()
datetime | Group | Security Gate | In | Out | In_Out_Diff | |
---|---|---|---|---|---|---|
0 | 2022-09-15 05:00:00 | Geisel - East Entrance | FEIG Gate 1st Counter on Bus 1 | 1 | 0 | 1 |
1 | 2022-09-15 05:00:00 | Geisel - East Entrance | FEIG Gate 2nd Counter on Bus 1 | 2 | 0 | 2 |
2 | 2022-09-15 05:00:00 | Geisel - West Entrance | FEIG Gate 1st Counter on Bus 1 | 0 | 0 | 0 |
3 | 2022-09-15 05:00:00 | Geisel - West Entrance | FEIG Gate 2nd Counter on Bus 1 | 0 | 1 | -1 |
4 | 2022-09-15 06:00:00 | Geisel - East Entrance | FEIG Gate 1st Counter on Bus 1 | 13 | 14 | -1 |
Combining all gate counters’ sum as the total in/out values.
= Geisel_df.groupby(['datetime', 'Security Gate']).agg({'In': 'sum', 'Out': 'sum'}).reset_index()
Geisel_Merged_df = Geisel_Merged_df.groupby('datetime').agg({'In': 'sum', 'Out': 'sum'}).reset_index()
Geisel_Merged_df 'In_Out_Diff'] = Geisel_Merged_df['In'] - Geisel_Merged_df['Out']
Geisel_Merged_df[
= WongAvery_df.groupby(['datetime', 'Security Gate']).agg({'In': 'sum', 'Out': 'sum'}).reset_index()
WongAvery_Merged_df = WongAvery_Merged_df.groupby('datetime').agg({'In': 'sum', 'Out': 'sum'}).reset_index()
WongAvery_Merged_df 'In_Out_Diff'] = WongAvery_Merged_df['In'] - WongAvery_Merged_df['Out'] WongAvery_Merged_df[
Since during breaks, there’s almost no one in the library, so we will neglect these days. The days we keep will be in the range of the 2022-2023 academic year.
2022-09-19
to 2022-12-10
2023-01-04
to 2023-03-15
2023-03-29
to 2023-06-16
def keepQuarterDay(dataFrame: pd.DataFrame) -> pd.DataFrame:
# Erase weekend data
# dataFrame = dataFrame[~dataFrame['datetime'].dt.weekday.isin([5, 6])]
# Quarter days filter
= (dataFrame['datetime'] >= '2022-09-19') & (dataFrame['datetime'] <= '2022-12-10')
fall_range = (dataFrame['datetime'] >= '2023-01-04') & (dataFrame['datetime'] <= '2023-03-15')
winter_range = (dataFrame['datetime'] >= '2023-03-29') & (dataFrame['datetime'] <= '2023-06-16')
spring_range
return dataFrame[fall_range | winter_range | spring_range]
= keepQuarterDay(Geisel_Merged_df)
Geisel_Merged_df = keepQuarterDay(WongAvery_Merged_df)
WongAvery_Merged_df = keepQuarterDay(Geisel_df)
Geisel_df = keepQuarterDay(WongAvery_df) WongAvery_df
Geisel dataset after cleaning:
Geisel_Merged_df.head()
datetime | In | Out | In_Out_Diff | |
---|---|---|---|---|
30 | 2022-09-19 06:00:00 | 2 | 3 | -1 |
31 | 2022-09-19 07:00:00 | 42 | 8 | 34 |
32 | 2022-09-19 08:00:00 | 110 | 22 | 88 |
33 | 2022-09-19 09:00:00 | 135 | 60 | 75 |
34 | 2022-09-19 10:00:00 | 180 | 116 | 64 |
Documentation of the dataset: Link.
After merging 2022 and 2023 data, here’s the raw form of the weather dataset:
= pd.read_csv("La Jolla 2022.csv")
weather_2022 = pd.read_csv("La Jolla 2023.csv")
weather_2023 = pd.concat([weather_2022, weather_2023], ignore_index=True)
weather_df
weather_df.head()
STATION | DATE | SOURCE | LATITUDE | LONGITUDE | ELEVATION | NAME | REPORT_TYPE | CALL_SIGN | QUALITY_CONTROL | WND | CIG | VIS | TMP | DEW | SLP | MD1 | OD1 | SA1 | REM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 99401899999 | 2022-01-01T00:00:00 | 4 | 32.867 | -117.258 | 2.0 | LA JOLLA, CA US | FM-13 | 99999 | V020 | 290,1,N,0082,1 | 99999,9,9,N | 999999,9,9,9 | +0135,1 | +9999,9 | 10073,1 | 3,1,003,1,+999,9 | 5,99,0098,1,999 | +154,1 | SYN080LJAC1 46/// /2916 10135 40073 53003 9000... |
1 | 99401899999 | 2022-01-01T01:00:00 | 4 | 32.867 | -117.258 | 2.0 | LA JOLLA, CA US | FM-13 | 99999 | V020 | 270,1,N,0062,1 | 99999,9,9,N | 999999,9,9,9 | +0134,1 | +9999,9 | 10074,1 | 2,1,008,1,+999,9 | 5,99,0082,1,999 | +153,1 | SYN080LJAC1 46/// /2712 10134 40074 52008 9010... |
2 | 99401899999 | 2022-01-01T02:00:00 | 4 | 32.867 | -117.258 | 2.0 | LA JOLLA, CA US | FM-13 | 99999 | V020 | 280,1,N,0057,1 | 99999,9,9,N | 999999,9,9,9 | +0132,1 | +9999,9 | 10081,1 | 3,1,013,1,+999,9 | 5,99,0072,1,999 | +152,1 | SYN080LJAC1 46/// /2811 10132 40081 53013 9020... |
3 | 99401899999 | 2022-01-01T03:00:00 | 4 | 32.867 | -117.258 | 2.0 | LA JOLLA, CA US | FM-13 | 99999 | V020 | 280,1,N,0046,1 | 99999,9,9,N | 999999,9,9,9 | +0131,1 | +9999,9 | 10087,1 | 3,1,014,1,+999,9 | 5,99,0067,1,999 | +152,1 | SYN080LJAC1 46/// /2809 10131 40087 53014 9030... |
4 | 99401899999 | 2022-01-01T04:00:00 | 4 | 32.867 | -117.258 | 2.0 | LA JOLLA, CA US | FM-13 | 99999 | V020 | 250,1,N,0057,1 | 99999,9,9,N | 999999,9,9,9 | +0133,1 | +9999,9 | 10094,1 | 2,1,020,1,+999,9 | 5,99,0062,1,999 | +153,1 | SYN080LJAC1 46/// /2511 10133 40094 52020 9040... |
'datetime'] = pd.to_datetime(weather_df['DATE'])
weather_df[=['DATE'], inplace=True) weather_df.drop(columns
=['STATION','SOURCE', 'LATITUDE', 'LONGITUDE', 'ELEVATION','CIG','VIS','DEW',
weather_df.drop(columns'NAME', 'CALL_SIGN','QUALITY_CONTROL','REPORT_TYPE','SLP','MD1','OD1','REM'],
= True) inplace
Drop N/A (+999,9
) value rows on Temperature
column.
= weather_df[weather_df['TMP'] != "+9999,9"] weather_df
Transform raw data to readable format.
'TMP'] = weather_df['TMP'].str.replace(',1', '').astype(float)
weather_df['TMP'] = weather_df['TMP'] / 10
weather_df[
'SA1'] = weather_df['SA1'].str.replace(',1', '').astype(float)
weather_df['SA1'] = weather_df['SA1'] / 10
weather_df[
= weather_df[['datetime', 'TMP', 'SA1', 'WND']]
weather_df = weather_df.rename(columns={'TMP': 'Dry_Bulb_Temp', 'SA1': 'Sea_Surface_Temp', 'WND': 'Wind'})
weather_df
'Wind_Degree', 'Wind_Speed']] = weather_df['Wind'].str.split(',', expand=True).iloc[:, [0, 3]]
weather_df[['Wind_Degree'] = weather_df['Wind_Degree'].astype(float)
weather_df['Wind_Speed'] = weather_df['Wind_Speed'].astype(float)
weather_df['Wind_Speed'] /= 10
weather_df[=['Wind'],inplace=True)
weather_df.drop(columns
def degrees_to_cardinal(degrees):
if degrees == 999:
return np.nan
if 45 <= degrees < 135:
return 'East'
elif 135 <= degrees < 225:
return 'South'
elif 225 <= degrees < 315:
return 'West'
else:
return 'North'
'Wind_Direction'] = weather_df['Wind_Degree'].apply(degrees_to_cardinal) weather_df[
Here’s the weather dataset after cleaning:
weather_df.head()
datetime | Dry_Bulb_Temp | Sea_Surface_Temp | Wind_Degree | Wind_Speed | Wind_Direction | |
---|---|---|---|---|---|---|
0 | 2022-01-01 00:00:00 | 13.5 | 15.4 | 290.0 | 8.2 | West |
1 | 2022-01-01 01:00:00 | 13.4 | 15.3 | 270.0 | 6.2 | West |
2 | 2022-01-01 02:00:00 | 13.2 | 15.2 | 280.0 | 5.7 | West |
3 | 2022-01-01 03:00:00 | 13.1 | 15.2 | 280.0 | 4.6 | West |
4 | 2022-01-01 04:00:00 | 13.3 | 15.3 | 250.0 | 5.7 | West |
Sort the Geisel dataset based on descending values of In
count column.
='In', ascending=False).head() Geisel_Merged_df.sort_values(by
datetime | In | Out | In_Out_Diff | |
---|---|---|---|---|
3924 | 2023-05-10 10:00:00 | 79637 | 371206 | -291569 |
1693 | 2023-01-16 06:00:00 | 20175 | 13040 | 7135 |
1692 | 2023-01-16 04:00:00 | 14721 | 12762 | 1959 |
1691 | 2023-01-15 03:00:00 | 2629 | 14846 | -12217 |
168 | 2022-09-27 13:00:00 | 1510 | 1269 | 241 |
We dropped index 3924
, 1693
, 1692
, 1691
because we identified these specific data points as outliers as it is highly unlikely that Geisel Library had over 10000 incoming/outgoing traffic. Geisel is pretty huge, but it’s still not that big!
These data points also didn’t correlate with trends shown during the timestamps immediately before and afterwards. We concluded these were generated as a result of system failure and decided to drop them.
= Geisel_Merged_df.drop([3924, 1693, 1692, 1691]) #Drop outlier index Geisel_Merged_df
We first examined the distribution of In
, Out
counts and the difference of in-out counts In_Out_Diff
.
= plt.subplots(1, 3, figsize = (18, 5))
fig, axes
# Histogram for 'In'
'In'], bins = 30, ax = axes[0], kde = True)
sns.histplot(Geisel_Merged_df[0].set_title('Distribution of In Counts',fontsize=20)
axes[
# Histogram for 'Out'
'Out'], bins = 30, ax = axes[1], kde = True)
sns.histplot(Geisel_Merged_df[1].set_title('Distribution of Out Counts',fontsize=20)
axes[
# Histogram for 'In_Out_Diff'
'In_Out_Diff'], bins = 30, ax = axes[2], kde = True)
sns.histplot(Geisel_Merged_df[2].set_title('Distribution of In-Out Difference',fontsize=20)
axes[
plt.tight_layout() plt.show()
Here’s the descriptive statistics:
= Geisel_Merged_df[['In', 'Out', 'In_Out_Diff']].describe()
descriptive_stats print(descriptive_stats)
In Out In_Out_Diff
count 4238.000000 4238.000000 4238.000000
mean 259.138745 283.027607 -23.888863
std 262.288989 275.191358 144.493022
min 0.000000 0.000000 -1072.000000
25% 29.250000 51.000000 -103.000000
50% 171.500000 188.000000 -13.000000
75% 425.000000 456.000000 69.000000
max 1510.000000 1433.000000 388.000000
Some preprocessing for later EDA.
def assign_quarter(x):
= {
quarters 'Fall': {'start': '2022-09-19', 'end': '2022-12-10'},
'Winter': {'start': '2023-01-04', 'end': '2023-03-15'},
'Spring': {'start': '2023-03-29', 'end': '2023-06-16'}
}
= pd.Timestamp(x)
x for quarter, dates in quarters.items():
if pd.to_datetime(dates['start']) <= x <= pd.to_datetime(dates['end']):
return quarter
= Geisel_Merged_df.copy()
Geisel_Merged_EDA_df 'hour'] = Geisel_Merged_EDA_df['datetime'].dt.hour
Geisel_Merged_EDA_df['date'] = Geisel_Merged_EDA_df['datetime'].dt.date
Geisel_Merged_EDA_df['month'] = Geisel_Merged_EDA_df['datetime'].dt.month
Geisel_Merged_EDA_df['weekday'] = Geisel_Merged_EDA_df['datetime'].dt.weekday
Geisel_Merged_EDA_df['quarter'] = Geisel_Merged_EDA_df['datetime'].apply(lambda x: assign_quarter(x)) Geisel_Merged_EDA_df[
=Geisel_Merged_EDA_df,
sns.lineplot(data='hour',
x='In',
y='quarter',
hue=['#FFA500','#3498db','#4CAF50'],
palette=None)
errorbar'Hourly Trend of IN',fontsize=16)
plt.title('Hour')
plt.xlabel('IN count')
plt.ylabel( plt.show()
The Geisel Library opens at 7:30am, and we can see there is an obvious increase around that time. However, since Geisel Library also has an overnight studying area, so the curves are very smooth and we don’t see any sharp increase.
=Geisel_Merged_EDA_df,
sns.lineplot(data='hour',
x='Out',
y='quarter',
hue=['#FFA500','#3498db','#4CAF50'],
palette=None)
errorbar'Hourly Trend of OUT',fontsize=16)
plt.title('Hour')
plt.xlabel('OUT count')
plt.ylabel( plt.show()
The Geisel Library closes at 10:00pm on most days, and we can see there is an obvious decrease around that time. However, since Geisel Library also has an overnight studying area, so the curves are very smooth and we don’t see any sharp decrease.
##First, we merged the hourly data to daily data for annual analysis.
= Geisel_Merged_EDA_df.groupby('date')[['In', 'Out','weekday']].median().reset_index()
Geisel_date_df 'quarter'] = Geisel_date_df['date'].apply(lambda x: assign_quarter(x))
Geisel_date_df[
# Create the line plot with dots
=(12, 6))
plt.figure(figsize=Geisel_date_df,
sns.lineplot(data='date',
x='In',
y='quarter',
hue=True,
markers='o',
marker=6,
markersize='-',
linestyle=2,
linewidth=['#FFA500','#3498db','#4CAF50'])
palette
# Customize plot details
'Daily In Trends of Geisel Library Across Quarters',fontsize=20)
plt.title('Date')
plt.xlabel('Counts')
plt.ylabel(=45)
plt.xticks(rotation='Quarter', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.legend(title
plt.tight_layout()
# Show the plot
plt.show()
Since Geisel Library opens seven days a week except for some holidays, so we do not need to do any further data cleaning regarding the library hours of Grisel Library. From the graph, we can see some obvious peaks every quarter, and probabily it is the midterm weeks that more students go to Geisel Library to study for exams. Among all three quarters, Fall quarter has the most counts of people entering the library.
= Geisel_Merged_EDA_df[Geisel_Merged_EDA_df['quarter'] == 'Fall'].copy()
Geisel_fall_df = Geisel_Merged_EDA_df[Geisel_Merged_EDA_df['quarter'] == 'Winter'].copy()
Geisel_winter_df = Geisel_Merged_EDA_df[Geisel_Merged_EDA_df['quarter'] == 'Spring'].copy() Geisel_spring_df
'date'] = Geisel_fall_df['datetime'].dt.date
Geisel_fall_df[= Geisel_fall_df.groupby('date').agg({'In': 'median', 'Out': 'median'}).reset_index()
Geisel_daily_fall_data
=(12, 6))
plt.figure(figsize=Geisel_daily_fall_data, x='date', y='In', label='In', marker='o')
sns.lineplot(data=Geisel_daily_fall_data, x='date', y='Out', label='Out', marker='o')
sns.lineplot(data
= pd.date_range('2022-09-19', '2022-12-10', freq='W-MON')
week_range
'Fall Quarter Daily In Trends of Geisel Library',fontsize=20)
plt.title('Date (Monday as labels)')
plt.xlabel('Count')
plt.ylabel(=45)
plt.xticks(week_range, rotation plt.show()
At the beginning of the school year, there is a peak of counts, and probably it is because new students all want to take a look of the famous Geisel Library. Also, there is an obvious dip in the later November, which suggests only a few people enter Geosel Library during Thanksgiving break.
'date'] = Geisel_winter_df['datetime'].dt.date
Geisel_winter_df[= Geisel_winter_df.groupby('date').agg({'In': 'median', 'Out': 'median'}).reset_index()
Geisel_daily_winter_data
=(12, 6))
plt.figure(figsize=Geisel_daily_winter_data, x='date', y='In', label='In', marker='o')
sns.lineplot(data=Geisel_daily_winter_data, x='date', y='Out', label='Out', marker='o')
sns.lineplot(data
= pd.date_range('2023-01-04','2023-03-15', freq='W-MON')
week_range
'Winter Quarter Daily In Trends of Geisel Library',fontsize=20)
plt.title('Date (Monday as labels)')
plt.xlabel('Count')
plt.ylabel(=45)
plt.xticks(week_range, rotation plt.show()
The data from Winter quarter is very periodic. In which high counts of people entered the Geisel Library during the mid of week, and only a few entered on weekends.
'date'] = Geisel_spring_df['datetime'].dt.date
Geisel_spring_df[= Geisel_spring_df.groupby('date').agg({'In': 'median', 'Out': 'median'}).reset_index()
Geisel_daily_spring_data
=(12, 6))
plt.figure(figsize=Geisel_daily_spring_data, x='date', y='In', label='In', marker='o')
sns.lineplot(data=Geisel_daily_spring_data, x='date', y='Out', label='Out', marker='o')
sns.lineplot(data
= pd.date_range('2023-03-29','2023-06-16', freq='W-MON')
week_range
'Spring Quarter Daily In Trends of Geisel Library',fontsize=20)
plt.title('Date (Monday as labels)')
plt.xlabel('Count')
plt.ylabel(=45)
plt.xticks(week_range, rotation plt.show()
Around 2023-04-10
, there is sharp increase of people leaving the Geisel Library, but the number of people entering the Geisel Library is not compatible. So, we guess the counter was down on that day. Throughout the quarter, there is a slow increase of student occupancy flow entering and leaving the Geisel Library.
Sort the Wong Avery dataset based on descending values of In
count column.
='In', ascending=False).head() WongAvery_Merged_df.sort_values(by
datetime | In | Out | In_Out_Diff | |
---|---|---|---|---|
986 | 2023-02-28 10:00:00 | 28406 | 29227 | -821 |
1804 | 2023-05-24 09:00:00 | 6790 | 6323 | 467 |
1509 | 2023-04-17 17:00:00 | 204 | 29 | 175 |
988 | 2023-02-28 12:00:00 | 167 | 84 | 83 |
991 | 2023-02-28 15:00:00 | 165 | 130 | 35 |
We dropped index 986
, 1804
because we identified these specific data points as outliers as it is highly unlikely that WongAvery Library had over 5000 people coming in in one hour. This is probably because of machine failures.
= WongAvery_Merged_df.drop([986, 1804]) #Drop outlier index WongAvery_Merged_df
We first examined the distribution of In
, Out
counts and the difference of in-out counts In_Out_Diff
.
= plt.subplots(1, 3, figsize = (18, 5))
fig, axes
# Histogram for 'In'
'In'], bins = 30, ax = axes[0], kde = True)
sns.histplot(WongAvery_Merged_df[0].set_title('Distribution of In Counts',fontsize=20)
axes[
# Histogram for 'Out'
'Out'], bins = 30, ax = axes[1], kde = True)
sns.histplot(WongAvery_Merged_df[1].set_title('Distribution of Out Counts',fontsize=20)
axes[
# Histogram for 'In_Out_Diff'
'In_Out_Diff'], bins = 30, ax = axes[2], kde = True)
sns.histplot(WongAvery_Merged_df[2].set_title('Distribution of In-Out Difference',fontsize=20)
axes[
plt.tight_layout() plt.show()
Here’s the descriptive statistics of the dataset:
= WongAvery_Merged_df[['In', 'Out', 'In_Out_Diff']].describe()
descriptive_stats print(descriptive_stats)
In Out In_Out_Diff
count 1803.000000 1803.000000 1803.000000
mean 55.387133 54.675541 0.711592
std 46.375260 45.069469 44.025394
min 0.000000 0.000000 -217.000000
25% 9.000000 8.000000 -16.000000
50% 47.000000 49.000000 2.000000
75% 97.000000 92.000000 26.000000
max 204.000000 254.000000 175.000000
We merged The hourly data to daily data for annual analysis.
= WongAvery_Merged_df.copy()
WongAvery_Merged_EDA_df 'date'] = WongAvery_Merged_EDA_df['datetime'].dt.date
WongAvery_Merged_EDA_df[
= WongAvery_Merged_EDA_df.groupby('date')[['In', 'Out']].median().reset_index()
date_df 'quarter'] = date_df['date'].apply(lambda x: assign_quarter(x))
date_df[
# Create the line plot with dots
=(12, 6))
plt.figure(figsize=date_df,
sns.lineplot(data='date',
x='In',
y='quarter',
hue=True,
markers='o',
marker=6,
markersize='-',
linestyle=2,
linewidth=['#FFA500','#3498db','#4CAF50'])
palette
# Customize plot details
'Daily In Trends of WongAvery Library Across Quarters',fontsize=20)
plt.title('Date')
plt.xlabel('Counts')
plt.ylabel(=45)
plt.xticks(rotation='Quarter', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.legend(title
plt.tight_layout()
# Show the plot
plt.show()
Unfortunately, we observed a significant proportion of missing data during winter quarter. We have to discard Wong Avery Library for future EDA and analysis😢.
Preprocessing data.
'Quarter'] = weather_df['datetime'].apply(lambda x: assign_quarter(x))
weather_df[
= pd.to_datetime('2022-09-19')
start_date = pd.to_datetime('2023-06-16')
end_date = weather_df[(weather_df['datetime'] >= start_date) & (weather_df['datetime'] <= end_date)] weather_df
= weather_df[weather_df['Wind_Speed'] <= 998]
wind_df = weather_df[weather_df['Wind_Degree'] <= 998] wind_df
Daily wind speed throughout the year:
= wind_df.groupby(wind_df["datetime"].dt.date)["Wind_Speed"].mean()
daily_mean_wind_speed =(17, 7))
plt.figure(figsize
plt.plot(daily_mean_wind_speed)'Daily Mean Wind Speed',fontsize=20)
plt.title('Date')
plt.xlabel('Mean Wind Speed')
plt.ylabel( plt.show()
Not much to tell from this graph. I guess its just randomly windy in San Diego.
Quarterly wind speed:
= weather_df.groupby('Quarter')['Wind_Speed'].mean()
quarterly_mean_wind_speed
=(6, 6))
plt.figure(figsize
# Plot using Seaborn barplot
=quarterly_mean_wind_speed.index,
sns.barplot(x= quarterly_mean_wind_speed.index,
hue =quarterly_mean_wind_speed,
y=['Fall','Winter','Spring'],
order=['#FFA500', '#4CAF50', '#3498db'],
palette=0.5)
width
# Set titles and labels
'Mean Quarter Wind Speed',fontsize=20)
plt.title('Quarter')
plt.xlabel('Mean Wind Speed')
plt.ylabel(
# Rotate x-axis labels
=0)
plt.xticks(rotation plt.show()
Turns out overall, spring quarter is more windy on average!
Daily temperature trend
= weather_df.groupby(weather_df["datetime"].dt.date)["Dry_Bulb_Temp"].mean()
daily_mean_dry_bulb = weather_df.groupby(weather_df["datetime"].dt.date)["Sea_Surface_Temp"].mean()
daily_mean_sea_surface
set(style="whitegrid")
sns.
=(17, 7))
plt.figure(figsize
# Plot using Seaborn lineplot
=daily_mean_dry_bulb.index, y=daily_mean_dry_bulb, label='Dry Bulb Temp', color='#4CAF50')
sns.lineplot(x=daily_mean_sea_surface.index, y=daily_mean_sea_surface, label='Sea Surface Temp', color='#3498db')
sns.lineplot(x
'Daily Mean Dry Bulb Temperature and Sea Surface Temperature',fontsize=20)
plt.title('Date')
plt.xlabel('Temperature (°C)')
plt.ylabel(
plt.legend()
plt.show()
As expected, temperature starts to decrease from start of fall quarter to the end of winter quarter. Seems UCSD did a decent research on how to separate quarters. Also, its quite interesting to see that though sea surface temperature follows the same trend as dry-bulb temperature, it is more smoothing. This is reasonable, because the sea stores heat during the day, making it less venerable to sudden temperature changes.
Quarterly Mean Temperature
= weather_df.groupby('Quarter')[['Dry_Bulb_Temp','Sea_Surface_Temp']].mean()
quarterly_mean_temp = quarterly_mean_temp.reset_index().melt(id_vars='Quarter', var_name='Temperature', value_name='Value') quarterly_mean_temp
=(8, 6))
plt.figure(figsize
= quarterly_mean_temp,
sns.barplot(data ='Quarter',
x='Value',
y='Temperature',
hue=['#4CAF50', '#3498db'],
palette=['Fall','Winter','Spring'],
order=0.8)
width
'Mean Temperatures by Quarter',fontsize=20)
plt.title('Quarter')
plt.xlabel('Temperature (°C)')
plt.ylabel(
plt.legend() plt.show()
Merge the two dataset based on their timestamps.
'Wind_Degree'] = weather_df['Wind_Degree'].replace(999.0, np.nan)
weather_df['Wind_Speed'] = weather_df['Wind_Speed'].replace(999.9, np.nan)
weather_df[
= pd.merge(Geisel_Merged_df, weather_df, on='datetime', how='left') merged_df
Check NaN
values:
any(axis=1)].head(8) merged_df[merged_df.isnull().
datetime | In | Out | In_Out_Diff | Dry_Bulb_Temp | Sea_Surface_Temp | Wind_Degree | Wind_Speed | Wind_Direction | Quarter | |
---|---|---|---|---|---|---|---|---|---|---|
21 | 2022-09-20 12:00:00 | 340 | 334 | 6 | 18.9 | 20.4 | NaN | 0.0 | NaN | Fall |
23 | 2022-09-20 14:00:00 | 289 | 282 | 7 | 19.2 | 21.5 | NaN | 0.0 | NaN | Fall |
50 | 2022-09-22 11:00:00 | 830 | 689 | 141 | 18.5 | 20.6 | NaN | 0.0 | NaN | Fall |
52 | 2022-09-22 13:00:00 | 1027 | 982 | 45 | 17.6 | 19.4 | NaN | 0.0 | NaN | Fall |
63 | 2022-09-23 06:00:00 | 11 | 4 | 7 | 20.3 | 19.5 | NaN | 0.0 | NaN | Fall |
65 | 2022-09-23 08:00:00 | 307 | 105 | 202 | 20.1 | 19.0 | NaN | 0.0 | NaN | Fall |
69 | 2022-09-23 12:00:00 | 760 | 794 | -34 | 19.5 | 20.7 | NaN | 0.0 | NaN | Fall |
70 | 2022-09-23 13:00:00 | 837 | 820 | 17 | 19.5 | 19.7 | NaN | 0.0 | NaN | Fall |
There are missing values in the weather dataset, and most of them are from wind related data. We will drop them as they are irrelevant to this study. We will also drop the Sea_Surface_Temp
, as it is not that related with UCSD students (we are not fishes~). The In_Out_Diff
is also dropped.
=['Wind_Degree','Wind_Speed','Wind_Direction','In_Out_Diff','Sea_Surface_Temp'],inplace=True)
merged_df.drop(columns={'Dry_Bulb_Temp': 'Temperature'},inplace=True) merged_df.rename(columns
Now the merged dataset looks like this:
merged_df.head()
datetime | In | Out | Temperature | Quarter | |
---|---|---|---|---|---|
0 | 2022-09-19 06:00:00 | 2 | 3 | 18.7 | Fall |
1 | 2022-09-19 07:00:00 | 42 | 8 | 18.8 | Fall |
2 | 2022-09-19 08:00:00 | 110 | 22 | 18.1 | Fall |
3 | 2022-09-19 09:00:00 | 135 | 60 | 18.4 | Fall |
4 | 2022-09-19 10:00:00 | 180 | 116 | 18.3 | Fall |
Since we observed that Geisel is significantly less populated during weekends, we will drop weekend dates for better correlations. We also dropped data during night time (11PM to 8 AM). Though Geisel has overnight study spaces, the top floors are all closed. The sample population is also inconsistent if we keep the night data.
= merged_df[(merged_df['datetime'].dt.weekday < 5)]
merged_df = merged_df[(merged_df['datetime'].dt.hour >= 8) & (merged_df['datetime'].dt.hour <= 22)] merged_df
Let’s now try to see if we can observe any relationship between temperature and occupancy!
= pd.plotting.scatter_matrix(frame=merged_df[['In', 'Out', 'Temperature']],
fig =(10,10),alpha=0.5)
figsizeFalse)
plt.grid('Scatter Matrix in Hourly Granularity',fontsize=20)
plt.suptitle( plt.show()
There’s a clear linear relationship between in
and out
. This makes sense, because usually when there’s a lot of people coming into the library, there should also be many coming out. But it is quite disappointing to see such mess when looking at the temperature
column. But note that we are observing at a hourly granularity, if we change it to a daily granularity, maybe thing can change.
# Extract date without time
'date'] = merged_df['datetime'].dt.date
merged_df[
= ['In', 'Out', 'Temperature']
median_columns = ['Quarter']
keep_columns
# Group by date and calculate the median for specified columns
= merged_df.groupby('date').agg({**{col: 'median' for col in median_columns}, **{col: 'first' for col in keep_columns}}).reset_index() daily_df
= pd.plotting.scatter_matrix(frame=daily_df[['In', 'Out', 'Temperature']],
fig =(10,10),alpha=0.5)
figsizeFalse)
plt.grid('Scatter Matrix in Daily Granularity',fontsize=20)
plt.suptitle( plt.show()
Take a closer look at the daily in/out flow and temperature.
= plt.subplots(1, 2, figsize=(16, 6))
fig, axes =daily_df, x='In', y='Temperature', ax=axes[0], scatter_kws={'s': 100})
sns.regplot(data=daily_df, x='Out', y='Temperature', ax=axes[1], scatter_kws={'s': 100,'color':sns.color_palette()[1]})
sns.regplot(data
0].set_title('IN Counts',fontsize=20)
axes[1].set_title('OUT Counts',fontsize=20)
axes[
'Whole Year Relationship between Temperature and Occupancy',fontsize=24)
plt.suptitle( plt.show()
When combining all quarters, the relationship is some what weak, as there’s many values outside the confidence interval. Maybe things will be better when splitting up quarters to be individuals.
= plt.subplots(2, 3, figsize=(16, 10))
fig, axes
# Filter data for each quarter
= daily_df[daily_df['Quarter'] == 'Fall']
fall_df = daily_df[daily_df['Quarter'] == 'Winter']
winter_df = daily_df[daily_df['Quarter'] == 'Spring']
spring_df
=['#FFA500', '#4CAF50', '#3498db']
palette# Create regression plots for the first row
=fall_df, x='In', y='Temperature', ax=axes[0, 0], scatter_kws={'s': 100,'color': '#FFA500'})
sns.regplot(data=winter_df, x='In', y='Temperature', ax=axes[0, 1], scatter_kws={'s': 100,'color': '#3498db'})
sns.regplot(data=spring_df, x='In', y='Temperature', ax=axes[0, 2], scatter_kws={'s': 100,'color':'#4CAF50'})
sns.regplot(data
=fall_df, x='Out', y='Temperature', ax=axes[1, 0], scatter_kws={'s': 100,'color': '#FFA500'})
sns.regplot(data=winter_df, x='Out', y='Temperature', ax=axes[1, 1], scatter_kws={'s': 100,'color':'#3498db'})
sns.regplot(data=spring_df, x='Out', y='Temperature', ax=axes[1, 2], scatter_kws={'s': 100,'color': '#4CAF50'})
sns.regplot(data# Set titles for the first row
'Quarterly Relationship between Temperature and Occupancy',fontsize=24)
plt.suptitle(0, 0].set_title('Fall',fontsize=20)
axes[0, 1].set_title('Winter',fontsize=20)
axes[0, 2].set_title('Spring',fontsize=20)
axes[
plt.show()
From the graph above, seems there is a weak linear relationship between temperature and in/out during Fall and Spring quarter. During Winter quarter, people just keep coming to Geisel regardless of temperature (Let’s go Tritons!).
We kinda shows that as temperature rises, more people visits Geisel, which proves our hypothesis. But we have to admit that there are outliers in both Fall and Winter quarter, not to mention Winter quarter.
We conclude that weather may be contributing to people’s deciding on whether to go to Libraries, but many more factors may also interfere it. But it’s also good to clear up the misconception about Geisel being crowed during Finals week. It’s Always crowed!
Informed consent:I think that the data our team collected both the weather data and the traffic accident data from a trusted pubblic website,the official San Diego government website.
Collection bias:The weather and traffic accident data are all observed events, and the data collected is more focued on the facts that happened instead of subjective observation. The factual focused data could help us reduce collection bias.
Limit PII exposure: Our data, though collected from various person and events, do not contain much personally identifiable information, there are no obvious part in our data that can lead to identification of certain indivisual.
Downstream bias mitigation: The data we are using are collected comprehensively thoughout all the traffic accidents rigardless of their race, gender or religious backgrounds.
Data security: Since the data are public accessible, our job of keeping our own data safe is to make our repo private and only grant access to relative indivisuals.
Right to be forgotten: We’ll open a contact portal to the public so that if anyone feels they wish to have their data removed, they have the freedom to do so.
Data retention plan: We plan on delete the data we’ve collected after week 10
Missing perspectives: We plan on talking to the TA and possibly the professor about our assumption to collect another point of view.
Dataset bias: The dataset is collected by the San Diego government through factual events e.g. weather, temperature, the number and the type of traffic accidents that happened. In conclusion, I think our dataset is mostly bias free, but we can’t rule out the bias created by the observer of the event. e.g police officer’s view on an accident,
Honest representation: we will try our best in representing the data without any bias
Privacy in analysis: We will insure that data with PII are not used or displayed unless necessary for the analysis
Auditability: We will insure that any analysis we conclude will be reasonable and reproducible