In this assignment, you'll analyze a collection of data sets from the San Francisco Open Data Portal and Zillow. The data sets have been stored in the SQLite database sf_data.sqlite
, which you can download here. The database contains the following tables:
Table | Description |
---|---|
crime |
Crime reports dating back to 2010. |
mobile_food_locations |
List of all locations where mobile food vendors sell. |
mobile_food_permits |
List of all mobile food vendor permits. More details here. |
mobile_food_schedule |
Schedules for mobile food vendors. |
noise |
Noise complaints dating back to August 2015. |
parking |
List of all parking lots. |
parks |
List of all parks. |
schools |
List of all schools. |
zillow |
Zillow rent and housing statistics dating back to 1996. More details here. |
The mobile_food_
tables are explicitly connected through the locationid
and permit
columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.
Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.
Shapefiles for San Francisco Neighborhoods are available here.
import pandas as pd
from matplotlib import pyplot as plt
import sqlalchemy as sqla
import sqlite3
from mpl_toolkits.basemap import Basemap
from shapely.geometry import Point, MultiPoint
import shapefile
import zipcode
import folium
from folium import plugins
from sklearn import preprocessing
import numpy as np
from matplotlib.patches import Circle, Wedge, Polygon
import matplotlib.cm as cm
import matplotlib.colors as colors
%matplotlib inline
plt.style.use('ggplot')
sqlite_file = '/Users/natty/STA141B/hw6/sf_data.sqlite'
food_conn = sqla.create_engine('sqlite:///' + sqlite_file)
query = """select "Applicant",count("locationid") as count from mobile_food_permits
where "Applicant"!=" "
group by "Applicant"
order by count("locationid") desc"""
Applicant_count = pd.read_sql_query(query,food_conn)
Applicant_count.head(1)
Conclusion :So according to the head of the result, Roadside Rotisserie Corporation/Country Grill has the most locations, as its count is 7.
Exercise 1.2. Ask and use the database to analyze 5 questions about San Francisco. For each question, write at least 150 words and support your answer with plots. Make a map for at least 2 of the 5 questions.
You should try to come up with some questions on your own, but these are examples of reasonable questions:
Please make sure to clearly state each of your questions in your submission.
* Which parts of the city are the most and least expensive?
query = """
select "RegionName",avg("MedianSoldPricePerSqft_AllHomes") as avg from zillow where "MedianSoldPricePerSqft_AllHomes"!=' ' group by "RegionName"
order by avg("MedianSoldPricePerSqft_AllHomes")
"""
sold = pd.read_sql_query(query,food_conn)
sold.tail(1)
We see that the area near the Embarcadero is the most expensive, perhaps for real estate, particularly for the many stores and restaurants there. This is perhaps due to the fact it is heavily visited by tourists, and is near the financial district, and near many nice restaurants, as well as newly created, upscale condos and apartments. This area is also a major transportation hub, as there is the BART station, as well as the ferries near the water. In addition, this district, in addition to the financial district, is close to other popular districts, such as Chinatown, North Beach. Many people, if they live in San Francisco, want to live near good transportation. In addition, many tech companies have moved closer downtown, so their workers would like to live close to work, and this has brought the price up of this district up, as historically the most expensive neighbourhood in San Francisco has always been Pacific Heights. However, because companies do not operate in that district, and the tech industry has become a more valuable asset to the city, the companies locations have increased in value over the past several years.
sold['lat'], sold['lon'] = [zipcode.isequal(str(x)).lat for x in sold.RegionName],[zipcode.isequal(str(x)).lon for x in sold.RegionName]
sold.head(1)
The least expensive part of San Francisco is the Bayview district. Due to the majority of the population of this district being Black, and because of the cities history of anti-Black procedures towards its Black residents, the least expensive part of the city would be occupied primarily by Black residents, because they have been economically pushed out of all of the other districts in the city. This can be seen with the Fillmore district, as well as the Ingleside heights district, both of which have been historically predominately Black neighbourhoods, but have been pushed out by the city through gentrification practiises. In addition, the Bayview district is very isolated from the other parts of the city, and lacks major transportation options, as well as restaurants, and many grocery stores, as it is mostly warehouses, and is close to several freeways. As how Oakland has been changing in that its historic Black population has been forced out due to increasing prices to other areas, this will likely affect this district in being that it will no longer be the least expensive district in the city, as more non-Black people move into the area and cause the values of the properties to increase.
*Which parts of the city are the most dangerous (and at what times)?
query = """select avg("Lon") as 'lon', avg("Lat") as 'lat',"PdDistrict",count(*) as 'count' from crime where "PdDistrict"!=' ' group by "PdDistrict" order by count(*) """
crime = pd.read_sql_query(query,food_conn)
crime.tail(1)
query = """select "Category","Lon","Lat" from crime"""
crime_map = pd.read_sql_query(query,food_conn)
x=list(crime_map.Lon)
y=list(crime_map.Lat)
fig, ax1 = plt.subplots(figsize=(10, 8))
my_map = Basemap(resolution='i',projection='merc',llcrnrlat=37.7026, urcrnrlat=37.8236, llcrnrlon=-122.5430, urcrnrlon=-122.3503)
x1,y1=my_map(x,y)
my_map.scatter(x1,y1,s=1,c='r',marker="o",cmap=cm.cool,alpha=0.8)
my_map.readshapefile("geo_export_44af5c14-2ca2-4c06-84fd-c8c77dbdcee6", 'sanfran')
x2,y2=my_map(list(crime.lon),list(crime.lat))
my_map.scatter(x2,y2,s=100,c=np.array(crime['count']),marker="o",cmap='viridis',alpha=1)
my_map.colorbar()
plt.show()
query = """select "PdDistrict", count(*) as count from crime where "PdDistrict" != '' group by "PdDistrict" order by count desc """
pd.read_sql_query(query,food_conn).head()
The southern district appears to be the most dangerous out of all the districts, and is particularly dangerous early in the morning, or late at night. The southern station is located downtown, near the SoMA district, (South of Market). This area has many nightclubs, which means many intoxicated people early in the morning. In addition to this, there is an increasing homeless population (quite large in this area),who live in this area, who are subject to a lot of police harassment, as well as crime towards them as well by other people in the communities. In addition to the SoMA district, the southern police district also covers the Embarcadero district, as well as parts of Chinatown. In other words, the police district covers a relatively large section of downtown San Francisco, which is home to many bars, and as mentioned nightclubs that close late at night. While historically, the tenderloin was perceieved to be the most dangerous area, as that area has been changed by gentrification, more or likely these areas have because the new home to previous offenders.
query = """select "DayOfWeek", count(*) as count from crime where "PdDistrict"=="SOUTHERN" group by "DayOfWeek" order by count desc """
pd.read_sql_query(query,food_conn)
Friday appears to be the day of the week with most crime activities, likely because it is the last day of the work week. In regards to that notion, more than likely crime is occurring on this day because people are going out on this day, and there are more people out on the streets, thus making crimes of opportunities more likely to happen. This means that with more people, there is a greater chance of: pick pockets, items such as purses being taken, as well as noise complaints. In addition, Friday also means that other people outside of the city will come and visit the city for eating or for doing other activities, such as going to bars or visiting clubs. Due to not knowing what the highest form of crime that is happening is, we can only suspect so much as assume that it is due to the increased population of people, likely in the downtown area.
query ="""select "Lat" as lat,"Lon" as lon from noise where lat != '' and lon != '' """
noise_loc = pd.read_sql_query(query,food_conn)
query ="""select "Latitude" as lat, "Longitude" as lon from mobile_food_locations"""
food_loc = pd.read_sql_query(query,food_conn)
fig, ax1 = plt.subplots(figsize=(12, 10))
cmap = Basemap(llcrnrlat=37.7026, urcrnrlat=37.8236, llcrnrlon=-122.5430, urcrnrlon=-122.3503,
resolution='i',projection='merc')
cmap.readshapefile("geo_export_44af5c14-2ca2-4c06-84fd-c8c77dbdcee6", 'comarques')
x1,y1 = cmap(list(noise_loc.lon),list(noise_loc.lat))
x2,y2 = cmap(list(food_loc.lon),list(food_loc.lat))
cmap.scatter(x1,y1,s=10,c='r',marker="o",cmap=cm.jet,alpha=0.8,)
cmap.scatter(x2,y2,s=10,c='b',marker="o",cmap=cm.jet,alpha=0.8,)
print np.mean(food_loc.lat), np.mean(noise_loc.lat)
print np.mean(food_loc.lon),np.mean(noise_loc.lon)
As we see that food vendors are labeled in blue, and nosie complaints are labelled in red, we see that the majority of fod vendors are located towards the eastern part of the city, close to downtown where the majority of people who work downtown can easily venture off and find a food truck. The amount of noise complaints are scattered completely across the city, and the majority of them are being reported no where near any food trucks. For instance, there are many noise complaints happening in residential area such as the Sunset district, the Richmond district, as well as the Ingleside district. While there are some stores in these areas, there are no food trucks in these areas because they are mostly residential areas, and the food truck companies locations appear to be influenced by the highest population of workers, who will likely go out for lunch and buy from the food trucks. It is unlikely that if a food truck was operating in a residential area such as the outer sunset, that they would get a lot of customers, and would likely not be allowed to be there because of the fact it’s a residential area.
query = """
select "Address" as address,"DayOfWeek" as weekday, "EndHour" as end, "StartHour" as start, count(*) as count from mobile_food_schedule
join mobile_food_locations on mobile_food_schedule.locationid = mobile_food_locations.locationid
join mobile_food_permits on mobile_food_schedule.permit = mobile_food_permits.permit
where "FacilityType" == 'Truck'
group by Address, weekday
order by count desc
"""
truck = pd.read_sql_query(query,food_conn)
truck.head(10)
We see the best time and place is at 400 Howard ST. Monday to Friday noon to night. We see that the highest visited food truck locations occur downtown, and happen during noon into the night. This is likely because starting at noon, many of the people who work downtown venture out for lunch and will likely visit food trucks, because of their location to their offices, as well as the fact that the food is made relatively faster as opposed if they would visit a sit down restaurant for lunch. In addition to this, because of its location, and due to the heavy foot traffic of the downtown area, these food trucks are easily accessible to anyone who is going through this area after they get off any major transportation method in the nearby area. For example, someone getting off from work, coming off from the BART station can easily access these food trucks, as they are located close to downtown. In addition, this area is easily accessible to not just workers, but anyone shopping during the day, which includes tourists.
fig, ax1 = plt.subplots(figsize=(12, 10))
dmap = Basemap(llcrnrlat=37.7026, urcrnrlat=37.8236, llcrnrlon=-122.5430, urcrnrlon=-122.3503,
resolution='i',projection='merc')
dmap.readshapefile("geo_export_44af5c14-2ca2-4c06-84fd-c8c77dbdcee6", 'comarques')
x1,y1=bmap(list(sold.lon),list(sold.lat))
x2,y2 = cmap(list(noise_loc.lon),list(noise_loc.lat))
x3,y3 = cmap(list(food_loc.lon),list(food_loc.lat))
dmap.scatter(x2,y2,s=10,c='r',marker="o",cmap=cm.jet,alpha=0.4,)
dmap.scatter(x3,y3,s=10,c='b',marker="o",cmap=cm.jet,alpha=0.4,)
dmap.scatter(x1,y1,s=100,c=np.array(sold.avg),marker="o",cmap='viridis',alpha=1,)
dmap.colorbar()
We see that the most expensive area of the city, which is the embarcadero district, has a lot of noise complaints as well as crime activities. This is due to the fact this area is home to many restaurants, clubs, as well as major transportation hubs such as the Embarcadero BART station, and the nearby ferries . In other words, this area brings a lot of different people together, and as stated earlier, wherever there is a lot of people, the greater the chances of criminal activity to occur, because it allows for greater opportunities. In other words, we can presume that noise and crime are correlated with one another, if we take into consideration that the noise arises from increases in population, which therefore makes crimes more likely to happen because there are more people around who can be victims of crimes. Once again this can be determined more closely, if we knew what the highest crime activity that was occurring was, which from that we could justify that it is due to more people being in the area. This also supports the fact that the day of the week with the most reported crimes, is Friday, which is known to increase more people to areas where there is more things to do such as go out and eat, and attend events etc.