Project #6: Data Science Web App with Streamlit and Python
This project was done through Coursera Rhyme which is a cloud workspace platform used for Guided Project from Coursera Project Network
https://www.coursera.org/projects/data-science-streamlit-python
About the Instructor: Snehan Kekre is a Developer Advocate at Snowflake, where he specializes in the Streamlit open-source Python library. In the past, he has worked for Streamlit (pre-acquisition) as a Developer Advocate , and has authored and taught over 40+ guided projects on machine learning and data science at Coursera. He has also worked as a skills consultant at Coursera, and as content strategist at Rhyme.com.
Skilled Learned / Improved: Web Development, Data Science Visualisation, Streamlit Library, Python Programming
Project Highlights & Modules
Work with New York City Motor Vehicle Collisions Dataset.
Build interactive web applications with Streamlit and Python.
Use Pandas for data manipulation in data science workflows.
Load, explore, visualize and interact with data, and generate dashboards in less than 100 lines of Python code.
Use 3D Map layers and Histograms to answer questions like which streets cause the most collisions, what time of the day most accidents occur etc
Date of Project Completion: 5th July 2023
Project Images
Code & Project Resources
Dataset Provided: https://catalog.data.gov/dataset/motor-vehicle-collisions-crashes
Library Requirements : numpy==1.16.4, pandas==0.24.2, pydeck==0.3.0, streamlit==0.57.3, plotly==4.0.0
My Final Code:
import streamlit as st
import pandas as pd
import numpy as np
import pydeck as pdk
import plotly.express as px
st.title("Data Science Web #1")
st.markdown("## Motor Vehicle Collisions in New York City")
st.markdown("#### Made by Arjun Raghunandanan following the instructor using Python & Streamlit for Coursera Project Network")
# Modify This URL according to where you have stored your csv file or from where you are fetching the csv file
#option 1 : fetch code from online
#DATA_URL = "https://data.cityofnewyork.us/api/views/h9gi-nx95/rows.csv?accessType=DOWNLOAD"
#option 2 : download csv file and load it offline (I used this method)
DATA_URL = ("/home/rhyme/Desktop/Project/Motor_Vehicle_Collisions_-_Crashes.csv")
# Function to load the data
@st.cache(persist=True) # Cache the data to avoid reloading on each run
def load_data(nrows):
data = pd.read_csv(DATA_URL, nrows=nrows, parse_dates=[['CRASH_DATE', 'CRASH_TIME']])
data.dropna(subset=['LATITUDE', 'LONGITUDE'], inplace=True)
lowercase = lambda x: str(x).lower()
data.rename(lowercase, axis='columns', inplace=True)
data.rename(columns={'crash_date_crash_time': 'date/time'}, inplace=True)
return data
# Load the data
data = load_data(100000)
original_data = data.copy() # Create a copy of the original data for later use
# Display header and slider for selecting the number of injured people
st.header("Where are the most people injured in NYC?")
injured_people = st.slider("Number of People Injured in Vehicle Collisions", 0, 19)
st.map(data.query("injured_persons >= @injured_people")[["latitude", "longitude"]].dropna(how="any"))
# Display header and slider for selecting the hour
st.header("How many collisions occur during a given time of day?")
hour = st.slider("Hour to look at", 0, 23)
filtered_data = data[data['date/time'].dt.hour == hour]
st.markdown("Vehicle Collisions between %i:00 and %i:00" % (hour, (hour + 1) % 24))
midpoint = np.average(filtered_data['latitude']), np.average(filtered_data['longitude'])
st.pydeck_chart(pdk.Deck(
map_style="mapbox://styles/mapbox/light-v9",
initial_view_state={
"latitude": midpoint[0],
"longitude": midpoint[1],
"zoom": 11,
"pitch": 50,
},
layers=[
pdk.Layer(
"HexagonLayer",
data=filtered_data[['date/time', 'latitude', 'longitude']],
get_position=['longitude', 'latitude'],
radius=100,
elevation_scale=4,
elevation_range=[0, 1000],
pickable=True,
extruded=True,
),
],
))
# Display subheader and histogram chart for breakdown by minute
st.subheader("Breakdown by minute between %i:00 and %i:00" % (hour, (hour + 1) % 24))
filtered = data[
(data['date/time'].dt.hour >= hour) & (data['date/time'].dt.hour < (hour + 1))
]
hist = np.histogram(filtered['date/time'].dt.minute, bins=60, range=(0, 60))[0]
chart_data = pd.DataFrame({'minute': range(60), 'crashes': hist})
fig = px.bar(chart_data, x='minute', y='crashes', hover_data=['minute', 'crashes'], height=400)
st.plotly_chart(fig)
# Display header and selectbox for selecting the affected type
st.header("Top 5 Dangerous Streets by Affected Type")
select = st.selectbox('Affected Type of People', ['Pedestrians', 'Cyclists', 'Motorists'])
if select == 'Pedestrians':
st.write(original_data.query("injured_pedestrians >= 1")[["on_street_name", "injured_pedestrians"]]
.sort_values(by=['injured_pedestrians'], ascending=False).dropna(how='any')[:5])
elif select == 'Cyclists':
st.write(original_data.query("injured_cyclists >= 1")[["on_street_name", "injured_cyclists"]]
.sort_values(by=['injured_cyclists'], ascending=False).dropna(how='any')[:5])
else:
st.write(original_data.query("injured_motorists >= 1")[["on_street_name", "injured_motorists"]]
.sort_values(by=['injured_motorists'], ascending=False).dropna(how='any')[:5])
# Checkbox to display raw data
if st.checkbox("Show Raw Data", False):
st.subheader('Raw Data')
st.write(data)