Skip to content

Commit

Permalink
classification
Browse files Browse the repository at this point in the history
  • Loading branch information
arunp77 committed Feb 24, 2024
1 parent 0f457ff commit 302e82f
Show file tree
Hide file tree
Showing 2 changed files with 1,204 additions and 0 deletions.
170 changes: 170 additions & 0 deletions Projects-ML/Reg-models/Supervised-ML-project.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Project to understand Linear regression, Classification and Regression in Python\n",
"\n",
"We will take sklearn's housing dataset available. This dataset contains information about housing values in suburban areas, such as the median value of owner-occupied homes, along with various other attributes like crime rate, proximity to highways, and socioeconomic factors. This dataset is commonly used for regression tasks to predict housing prices based on these features.\n",
"\n",
"**Data source:** [lib.stat.cmu.edu/datasets/boston](lib.stat.cmu.edu/datasets/boston)\n",
"\n",
"\n",
" The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n",
" ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.\n",
"\n",
" Variables in order:\n",
"- **CRIM:** per capita crime rate by town\n",
"- **ZN:** proportion of residential land zoned for lots over 25,000 sq.ft.\n",
"- **INDUS:** proportion of non-retail business acres per town\n",
"- **CHAS:** Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n",
"- **NOX:** nitric oxides concentration (parts per 10 million)\n",
"- **RM:** average number of rooms per dwelling\n",
"- **AGE:** proportion of owner-occupied units built prior to 1940\n",
"- **DIS:** weighted distances to five Boston employment centres\n",
"- **RAD:** index of accessibility to radial highways\n",
"- **TAX:** full-value property-tax rate per $10,000\n",
"- **PTRATIO:** pupil-teacher ratio by town\n",
"- **B:** 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n",
"- **LSTAT:** % lower status of the population\n",
"- **MEDV:** Median value of owner-occupied homes in $1000's\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np \n",
"import matplotlib.pyplot as plt \n",
"import seaborn as sns"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Data saved successfully as boston_dataset.csv\n"
]
}
],
"source": [
"import pandas as pd\n",
"import requests\n",
"\n",
"# URL of the dataset\n",
"url = \"http://lib.stat.cmu.edu/datasets/boston\"\n",
"\n",
"# Send a GET request to fetch the data\n",
"response = requests.get(url)\n",
"\n",
"# Check if the request was successful\n",
"if response.status_code == 200:\n",
" # Decode the content of the response\n",
" content = response.content.decode(\"utf-8\")\n",
" \n",
" # Save the content to a CSV file\n",
" with open(\"boston_dataset.csv\", \"w\") as f:\n",
" f.write(content)\n",
"\n",
" print(\"Data saved successfully as boston_dataset.csv\")\n",
"\n",
" # Optionally, you can also load the data into a DataFrame\n",
" # data = pd.read_csv(\"boston_dataset.csv\")\n",
"\n",
"else:\n",
" print(\"Failed to fetch data from the URL\")\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"ename": "ParserError",
"evalue": "Error tokenizing data. C error: Expected 3 fields in line 3, saw 5\n",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mParserError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[1;32mIn[10], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m df \u001b[38;5;241m=\u001b[39m \u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread_csv\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mboston_dataset.csv\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n",
"File \u001b[1;32mc:\\Users\\hp\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:948\u001b[0m, in \u001b[0;36mread_csv\u001b[1;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)\u001b[0m\n\u001b[0;32m 935\u001b[0m kwds_defaults \u001b[38;5;241m=\u001b[39m _refine_defaults_read(\n\u001b[0;32m 936\u001b[0m dialect,\n\u001b[0;32m 937\u001b[0m delimiter,\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 944\u001b[0m dtype_backend\u001b[38;5;241m=\u001b[39mdtype_backend,\n\u001b[0;32m 945\u001b[0m )\n\u001b[0;32m 946\u001b[0m kwds\u001b[38;5;241m.\u001b[39mupdate(kwds_defaults)\n\u001b[1;32m--> 948\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_read\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfilepath_or_buffer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mkwds\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[1;32mc:\\Users\\hp\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:617\u001b[0m, in \u001b[0;36m_read\u001b[1;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[0;32m 614\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m parser\n\u001b[0;32m 616\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m parser:\n\u001b[1;32m--> 617\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mparser\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnrows\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[1;32mc:\\Users\\hp\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:1748\u001b[0m, in \u001b[0;36mTextFileReader.read\u001b[1;34m(self, nrows)\u001b[0m\n\u001b[0;32m 1741\u001b[0m nrows \u001b[38;5;241m=\u001b[39m validate_integer(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnrows\u001b[39m\u001b[38;5;124m\"\u001b[39m, nrows)\n\u001b[0;32m 1742\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m 1743\u001b[0m \u001b[38;5;66;03m# error: \"ParserBase\" has no attribute \"read\"\u001b[39;00m\n\u001b[0;32m 1744\u001b[0m (\n\u001b[0;32m 1745\u001b[0m index,\n\u001b[0;32m 1746\u001b[0m columns,\n\u001b[0;32m 1747\u001b[0m col_dict,\n\u001b[1;32m-> 1748\u001b[0m ) \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_engine\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# type: ignore[attr-defined]\u001b[39;49;00m\n\u001b[0;32m 1749\u001b[0m \u001b[43m \u001b[49m\u001b[43mnrows\u001b[49m\n\u001b[0;32m 1750\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1751\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m:\n\u001b[0;32m 1752\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mclose()\n",
"File \u001b[1;32mc:\\Users\\hp\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\pandas\\io\\parsers\\c_parser_wrapper.py:234\u001b[0m, in \u001b[0;36mCParserWrapper.read\u001b[1;34m(self, nrows)\u001b[0m\n\u001b[0;32m 232\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m 233\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlow_memory:\n\u001b[1;32m--> 234\u001b[0m chunks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_reader\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread_low_memory\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnrows\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 235\u001b[0m \u001b[38;5;66;03m# destructive to chunks\u001b[39;00m\n\u001b[0;32m 236\u001b[0m data \u001b[38;5;241m=\u001b[39m _concatenate_chunks(chunks)\n",
"File \u001b[1;32mparsers.pyx:843\u001b[0m, in \u001b[0;36mpandas._libs.parsers.TextReader.read_low_memory\u001b[1;34m()\u001b[0m\n",
"File \u001b[1;32mparsers.pyx:904\u001b[0m, in \u001b[0;36mpandas._libs.parsers.TextReader._read_rows\u001b[1;34m()\u001b[0m\n",
"File \u001b[1;32mparsers.pyx:879\u001b[0m, in \u001b[0;36mpandas._libs.parsers.TextReader._tokenize_rows\u001b[1;34m()\u001b[0m\n",
"File \u001b[1;32mparsers.pyx:890\u001b[0m, in \u001b[0;36mpandas._libs.parsers.TextReader._check_tokenize_status\u001b[1;34m()\u001b[0m\n",
"File \u001b[1;32mparsers.pyx:2058\u001b[0m, in \u001b[0;36mpandas._libs.parsers.raise_parser_error\u001b[1;34m()\u001b[0m\n",
"\u001b[1;31mParserError\u001b[0m: Error tokenizing data. C error: Expected 3 fields in line 3, saw 5\n"
]
}
],
"source": [
"df = pd.read_csv(\"boston_dataset.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"ename": "ImportError",
"evalue": "\n`load_boston` has been removed from scikit-learn since version 1.2.\n\nThe Boston housing prices dataset has an ethical problem: as\ninvestigated in [1], the authors of this dataset engineered a\nnon-invertible variable \"B\" assuming that racial self-segregation had a\npositive impact on house prices [2]. Furthermore the goal of the\nresearch that led to the creation of this dataset was to study the\nimpact of air quality but it did not give adequate demonstration of the\nvalidity of this assumption.\n\nThe scikit-learn maintainers therefore strongly discourage the use of\nthis dataset unless the purpose of the code is to study and educate\nabout ethical issues in data science and machine learning.\n\nIn this special case, you can fetch the dataset from the original\nsource::\n\n import pandas as pd\n import numpy as np\n\n data_url = \"http://lib.stat.cmu.edu/datasets/boston\"\n raw_df = pd.read_csv(data_url, sep=\"\\s+\", skiprows=22, header=None)\n data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])\n target = raw_df.values[1::2, 2]\n\nAlternative datasets include the California housing dataset and the\nAmes housing dataset. You can load the datasets as follows::\n\n from sklearn.datasets import fetch_california_housing\n housing = fetch_california_housing()\n\nfor the California housing dataset and::\n\n from sklearn.datasets import fetch_openml\n housing = fetch_openml(name=\"house_prices\", as_frame=True)\n\nfor the Ames housing dataset.\n\n[1] M Carlisle.\n\"Racist data destruction?\"\n<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>\n\n[2] Harrison Jr, David, and Daniel L. Rubinfeld.\n\"Hedonic housing prices and the demand for clean air.\"\nJournal of environmental economics and management 5.1 (1978): 81-102.\n<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>\n",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mImportError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[1;32mIn[1], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mdatasets\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m load_boston\n",
"File \u001b[1;32mc:\\Users\\hp\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\sklearn\\datasets\\__init__.py:157\u001b[0m, in \u001b[0;36m__getattr__\u001b[1;34m(name)\u001b[0m\n\u001b[0;32m 108\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m name \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mload_boston\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[0;32m 109\u001b[0m msg \u001b[38;5;241m=\u001b[39m textwrap\u001b[38;5;241m.\u001b[39mdedent(\u001b[38;5;124m\"\"\"\u001b[39m\n\u001b[0;32m 110\u001b[0m \u001b[38;5;124m `load_boston` has been removed from scikit-learn since version 1.2.\u001b[39m\n\u001b[0;32m 111\u001b[0m \n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 155\u001b[0m \u001b[38;5;124m <https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>\u001b[39m\n\u001b[0;32m 156\u001b[0m \u001b[38;5;124m \u001b[39m\u001b[38;5;124m\"\"\"\u001b[39m)\n\u001b[1;32m--> 157\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mImportError\u001b[39;00m(msg)\n\u001b[0;32m 158\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m 159\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mglobals\u001b[39m()[name]\n",
"\u001b[1;31mImportError\u001b[0m: \n`load_boston` has been removed from scikit-learn since version 1.2.\n\nThe Boston housing prices dataset has an ethical problem: as\ninvestigated in [1], the authors of this dataset engineered a\nnon-invertible variable \"B\" assuming that racial self-segregation had a\npositive impact on house prices [2]. Furthermore the goal of the\nresearch that led to the creation of this dataset was to study the\nimpact of air quality but it did not give adequate demonstration of the\nvalidity of this assumption.\n\nThe scikit-learn maintainers therefore strongly discourage the use of\nthis dataset unless the purpose of the code is to study and educate\nabout ethical issues in data science and machine learning.\n\nIn this special case, you can fetch the dataset from the original\nsource::\n\n import pandas as pd\n import numpy as np\n\n data_url = \"http://lib.stat.cmu.edu/datasets/boston\"\n raw_df = pd.read_csv(data_url, sep=\"\\s+\", skiprows=22, header=None)\n data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])\n target = raw_df.values[1::2, 2]\n\nAlternative datasets include the California housing dataset and the\nAmes housing dataset. You can load the datasets as follows::\n\n from sklearn.datasets import fetch_california_housing\n housing = fetch_california_housing()\n\nfor the California housing dataset and::\n\n from sklearn.datasets import fetch_openml\n housing = fetch_openml(name=\"house_prices\", as_frame=True)\n\nfor the Ames housing dataset.\n\n[1] M Carlisle.\n\"Racist data destruction?\"\n<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>\n\n[2] Harrison Jr, David, and Daniel L. Rubinfeld.\n\"Hedonic housing prices and the demand for clean air.\"\nJournal of environmental economics and management 5.1 (1978): 81-102.\n<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>\n"
]
}
],
"source": [
"from sklearn.datasets import load_boston"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading

0 comments on commit 302e82f

Please sign in to comment.