database A

Use the existing dataset for the project and make it larger
What I used:
- https://apps.bea.gov/itable/?ReqID=70&step=1&_gl=1*62f72f*_ga*MTgyMjczNzI0MS4xNzYwNTQyNDQ2*_ga_J4698JNNFT*czE3NjA5NDA4NDMkbzIkZzAkdDE3NjA5NDA4NDMkajYwJGwwJGgw#eyJhcHBpZCI6NzAsInN0ZXBzIjpbMSwyOSwyNSwzMV0sImRhdGEiOltbIlRhYmxlSWQiLCI2MDAiXSxbIk1ham9yX0FyZWEiLCIwIl1dfQ==
  - I used this for GDP data (real GDP, GDP), has data from 1998-2024 per each state
  - There are other possibly useful features, like real personal income, personal income, disposable personal income, per capita personal income, total employment, per capita personal consumption expenditures
- info
- https://www.kff.org/state-health-policy-data/state-indicator/firearms-death-rate-per-100000/?currentTimeframe=0&sortModel={"colId":"Location","sort":"asc"}
  - This has death by firearm rate (per 100,000) for each US state from the years 1999-2024] (one feature)
- Options:
  - https://datacenter.aecf.org/locations?gad_source=1&gad_campaignid=22678288986&gbraid=0AAAAAD3xzvElCTVh72WaN4wUoDuD180jK&gclid=CjwKCAiAncvMBhBEEiwA9GU_fjLl4P8s-5dzQVbQKIGkAUzHkG6EYHW6lpWF8sNKa76oRAeGY8qAABoC6D0QAvD_BwE
    - This website has a bunch of options of good socioeconomic, demographic, and education related features, but the data is from 2016-2023 (this may not be enough years)
  - https://nces.ed.gov/programs/digest/d21/tables/dt21_104.10.asp
    - This has the education stuff from 1999-2021 woohoo
  - https://data.ers.usda.gov/reports.aspx?ID=4026#P99cc1147c60840b2bd5d8bb81ed09698_3_241iT2
    - This one has data per state, but years are merged
  - https://nces.ed.gov/programs/digest/d21/tables/dt21_104.80.asp
    - This has it per state, but only 2012 and 2022
  - https://nces.ed.gov/programs/digest/d18/tables/dt18_104.80.asp
    - This has 2000 and 2017
  - https://fred.stlouisfed.org/release/tables?rid=330&eid=391444&od=2007-01-01#
    - This has bachelor’s estimate from 2006-2024 per US state
  - https://data.census.gov/table/ACSST5Y2010.S1501?q=education&g=040XX00US01,02,04,05,06,08,09,10,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,44,45,46,47,48,49,50,51,53,54,55,56
    - This one might be the best, it has data from 2010-2024 per US state on educational attainment (38 features)
  - https://data.census.gov/table?q=poverty&g=040XX00US01,02,04,05,06,08,09,10,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,44,45,46,47,48,49,50,51,53,54,55,56
    - This has poverty data from 2010-2024 (62 features)
  - https://data.census.gov/table?q=language&g=040XX00US01,02,04,05,06,08,09,10,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,44,45,46,47,48,49,50,51,53,54,55,56
    - Language data from 2010-2024 (24 features)
  - https://data.census.gov/table?q=income&g=040XX00US01,02,04,05,06,08,09,10,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,44,45,46,47,48,49,50,51,53,54,55,56 (15 features)
    - Income data
In total, this could give us around 650 instances with a lot more than 11 features

database B

Using the kaggle database
- https://www.kaggle.com/datasets/jameslko/gun-violence-data/data
Comprehensive record of over 260k US gun violence incidents from 2013-2018
- They got the data from here: https://www.gunviolencearchive.org/
The person compiled the data in the wake of the Parkland shooting, since the perpetrator exhibited many signs on social media
- What if we could build a ML system that preemptively detected signs?
I think this could be good for feature engineering, since we could find the most valuable features leading to these specific incidents
The dataset has the location of each incident and characteristics of each incident (like the type of gun used, ages of perpetrators)
This person did EDA on the dataset: https://www.kaggle.com/code/shivamb/deep-exploration-of-gun-violence-in-us
- I previously looked at his project in introduction to data science
This person also used the same kaggle dataset and did more EDA: https://www.kaggle.com/code/erikbruin/gun-violence-in-the-us-eda-and-rshiny-app/report

database C

https://datahub.thetrace.org/data-library/?dir=desc&sort=date_updated&pg=1
This database has a bunch of different datasets, a lot of them recent and more focused on the actual gun violence incidents
- 21 different databases
Pros: sooo much data
The dataset: Quarterly Gun Deaths and Injuries by City had so many instances
Other good ones:
- Mass Shootings (this comes from the gun violence archive, basically the same as the Kaggle repo) (17 features)
- Firearm Sales → seems like it has a range of data per state from 2000-2025 of the amount of certain types of guns purchased (6 features)
- ATF Gun Dealer Thefts & Losses → data from 2013-2024 per state of guns stolen from dealers, along with firearms that went missing or were lost, sourced from the Bureau of Alcohol, Tobacco, Firearms and Explosives (10 features)
- Firearm Production → has data per year of the number of each type of firearm produced (probably not as useful since it’s not per state) (36 features, but not by state)
- CDC Gun Deaths → 💵 money database, has gun deaths per state for like 1999-2023, and has a bunch of different breakdowns of the gun violence rate (like gun type, urbanization, etc) (12 features including gun fatalities)
Idea: combine a lot of these with some sort of gun death metric per state, and then make classes for gun death rate
- Maybe compare like the socioeconomic data model with the gun related model using feature engineering??

dataset notes

how could I combine them…

Change all the state names to abbreviations
- West Virginia → WV

>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename(columns={"A": "a", "B": "c"})
   a  c
0  1  4
1  2  5
2  3  6

>>> df.rename(index={0: "x", 1: "y", 2: "z"})
   A  B
x  1  4
y  2  5
z  3  6

value_mapping = {
    'old_value1': 'new_value1',
    'old_value2': 'new_value2',
    'old_value3': 'new_value3'
}
df['column_name'] = df['column_name'].replace(value_mapping)

Make all the years map to this state
- WV05 for the year 2005

import pandas as pd
df['StateYear'] = df['state'].astype(str) + df['years'].astype(str)
print(df)

Make the different sections of data their own column
- This is probably what I need look to do

df_dummies = pd.get_dummies(df, columns=['your_category_column_name'], prefix=['category'])

Combine everything based on their new location + year index
- Maybe do this for each category first and then combine all of those
- Also maybe I can write a function that automates some of this

merged_df = pd.merge(df1, df2, on='common_col')

Dropping columns

df = df.drop('ColumnName', axis=1)

df = df.drop(columns=['ColumnName']) 

columns_to_drop = ['ColumnA', 'ColumnB', 'ColumnC']
df = df.drop(columns=columns_to_drop)

Exporting to CSV

df.to_csv('output_file.csv')