database A
- Use the existing dataset for the project and make it larger
- What I used:
- https://apps.bea.gov/itable/?ReqID=70&step=1&_gl=1*62f72f*_ga*MTgyMjczNzI0MS4xNzYwNTQyNDQ2*_ga_J4698JNNFT*czE3NjA5NDA4NDMkbzIkZzAkdDE3NjA5NDA4NDMkajYwJGwwJGgw#eyJhcHBpZCI6NzAsInN0ZXBzIjpbMSwyOSwyNSwzMV0sImRhdGEiOltbIlRhYmxlSWQiLCI2MDAiXSxbIk1ham9yX0FyZWEiLCIwIl1dfQ==
- I used this for GDP data (real GDP, GDP), has data from 1998-2024 per each state
- There are other possibly useful features, like real personal income, personal income, disposable personal income, per capita personal income, total employment, per capita personal consumption expenditures
- info
- https://www.kff.org/state-health-policy-data/state-indicator/firearms-death-rate-per-100000/?currentTimeframe=0&sortModel={"colId":"Location","sort":"asc"}
- This has death by firearm rate (per 100,000) for each US state from the years 1999-2024] (one feature)
- Options:
- https://datacenter.aecf.org/locations?gad_source=1&gad_campaignid=22678288986&gbraid=0AAAAAD3xzvElCTVh72WaN4wUoDuD180jK&gclid=CjwKCAiAncvMBhBEEiwA9GU_fjLl4P8s-5dzQVbQKIGkAUzHkG6EYHW6lpWF8sNKa76oRAeGY8qAABoC6D0QAvD_BwE
- This website has a bunch of options of good socioeconomic, demographic, and education related features, but the data is from 2016-2023 (this may not be enough years)
- https://nces.ed.gov/programs/digest/d21/tables/dt21_104.10.asp
- This has the education stuff from 1999-2021 woohoo
- https://data.ers.usda.gov/reports.aspx?ID=4026#P99cc1147c60840b2bd5d8bb81ed09698_3_241iT2
- This one has data per state, but years are merged
- https://nces.ed.gov/programs/digest/d21/tables/dt21_104.80.asp
- This has it per state, but only 2012 and 2022
- https://nces.ed.gov/programs/digest/d18/tables/dt18_104.80.asp
- https://fred.stlouisfed.org/release/tables?rid=330&eid=391444&od=2007-01-01#
- This has bachelor’s estimate from 2006-2024 per US state
- https://data.census.gov/table/ACSST5Y2010.S1501?q=education&g=040XX00US01,02,04,05,06,08,09,10,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,44,45,46,47,48,49,50,51,53,54,55,56
- This one might be the best, it has data from 2010-2024 per US state on educational attainment (38 features)
- https://data.census.gov/table?q=poverty&g=040XX00US01,02,04,05,06,08,09,10,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,44,45,46,47,48,49,50,51,53,54,55,56
- This has poverty data from 2010-2024 (62 features)
- https://data.census.gov/table?q=language&g=040XX00US01,02,04,05,06,08,09,10,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,44,45,46,47,48,49,50,51,53,54,55,56
- Language data from 2010-2024 (24 features)
- https://data.census.gov/table?q=income&g=040XX00US01,02,04,05,06,08,09,10,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,44,45,46,47,48,49,50,51,53,54,55,56 (15 features)
- In total, this could give us around 650 instances with a lot more than 11 features

database B
- Using the kaggle database
- Comprehensive record of over 260k US gun violence incidents from 2013-2018
- The person compiled the data in the wake of the Parkland shooting, since the perpetrator exhibited many signs on social media
- What if we could build a ML system that preemptively detected signs?
- I think this could be good for feature engineering, since we could find the most valuable features leading to these specific incidents
- The dataset has the location of each incident and characteristics of each incident (like the type of gun used, ages of perpetrators)
- This person did EDA on the dataset: https://www.kaggle.com/code/shivamb/deep-exploration-of-gun-violence-in-us
- I previously looked at his project in introduction to data science
- This person also used the same kaggle dataset and did more EDA: https://www.kaggle.com/code/erikbruin/gun-violence-in-the-us-eda-and-rshiny-app/report
database C
- https://datahub.thetrace.org/data-library/?dir=desc&sort=date_updated&pg=1
- This database has a bunch of different datasets, a lot of them recent and more focused on the actual gun violence incidents
- Pros: sooo much data
- The dataset: Quarterly Gun Deaths and Injuries by City had so many instances
- Other good ones:
- Mass Shootings (this comes from the gun violence archive, basically the same as the Kaggle repo) (17 features)
- Firearm Sales → seems like it has a range of data per state from 2000-2025 of the amount of certain types of guns purchased (6 features)
- ATF Gun Dealer Thefts & Losses → data from 2013-2024 per state of guns stolen from dealers, along with firearms that went missing or were lost, sourced from the Bureau of Alcohol, Tobacco, Firearms and Explosives (10 features)
- Firearm Production → has data per year of the number of each type of firearm produced (probably not as useful since it’s not per state) (36 features, but not by state)
- CDC Gun Deaths → 💵 money database, has gun deaths per state for like 1999-2023, and has a bunch of different breakdowns of the gun violence rate (like gun type, urbanization, etc) (12 features including gun fatalities)
- Idea: combine a lot of these with some sort of gun death metric per state, and then make classes for gun death rate
- Maybe compare like the socioeconomic data model with the gun related model using feature engineering??
dataset notes
how could I combine them…
- Change all the state names to abbreviations
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename(columns={"A": "a", "B": "c"})
a c
0 1 4
1 2 5
2 3 6
>>> df.rename(index={0: "x", 1: "y", 2: "z"})
A B
x 1 4
y 2 5
z 3 6
value_mapping = {
'old_value1': 'new_value1',
'old_value2': 'new_value2',
'old_value3': 'new_value3'
}
df['column_name'] = df['column_name'].replace(value_mapping)
- Make all the years map to this state
import pandas as pd
df['StateYear'] = df['state'].astype(str) + df['years'].astype(str)
print(df)
- Make the different sections of data their own column
- This is probably what I need look to do
df_dummies = pd.get_dummies(df, columns=['your_category_column_name'], prefix=['category'])
- Combine everything based on their new location + year index
- Maybe do this for each category first and then combine all of those
- Also maybe I can write a function that automates some of this
merged_df = pd.merge(df1, df2, on='common_col')
df = df.drop('ColumnName', axis=1)
df = df.drop(columns=['ColumnName'])
columns_to_drop = ['ColumnA', 'ColumnB', 'ColumnC']
df = df.drop(columns=columns_to_drop)
df.to_csv('output_file.csv')