classification

arunp77 · Feb 24, 2024 · 302e82f · 302e82f
1 parent 0f457ff
commit 302e82f
Show file tree

Hide file tree

Showing 2 changed files with 1,204 additions and 0 deletions.
diff --git a/Projects-ML/Reg-models/Supervised-ML-project.ipynb b/Projects-ML/Reg-models/Supervised-ML-project.ipynb
@@ -0,0 +1,170 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Project to understand Linear regression, Classification  and Regression in Python\n",
+    "\n",
+    "We will take sklearn's housing dataset available. This dataset contains information about housing values in suburban areas, such as the median value of owner-occupied homes, along with various other attributes like crime rate, proximity to highways, and socioeconomic factors. This dataset is commonly used for regression tasks to predict housing prices based on these features.\n",
+    "\n",
+    "**Data source:** [lib.stat.cmu.edu/datasets/boston](lib.stat.cmu.edu/datasets/boston)\n",
+    "\n",
+    "\n",
+    " The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic  prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n",
+    " ...', Wiley, 1980.   N.B. Various transformations are used in the table on pages 244-261 of the latter.\n",
+    "\n",
+    " Variables in order:\n",
+    "- **CRIM:**    per capita crime rate by town\n",
+    "- **ZN:**       proportion of residential land zoned for lots over 25,000 sq.ft.\n",
+    "- **INDUS:**    proportion of non-retail business acres per town\n",
+    "- **CHAS:**     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n",
+    "- **NOX:**      nitric oxides concentration (parts per 10 million)\n",
+    "- **RM:**       average number of rooms per dwelling\n",
+    "- **AGE:**      proportion of owner-occupied units built prior to 1940\n",
+    "- **DIS:**      weighted distances to five Boston employment centres\n",
+    "- **RAD:**      index of accessibility to radial highways\n",
+    "- **TAX:**      full-value property-tax rate per $10,000\n",
+    "- **PTRATIO:**  pupil-teacher ratio by town\n",
+    "- **B:**        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n",
+    "- **LSTAT:**    % lower status of the population\n",
+    "- **MEDV:**     Median value of owner-occupied homes in $1000's\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np \n",
+    "import matplotlib.pyplot as plt \n",
+    "import seaborn as sns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Data saved successfully as boston_dataset.csv\n"
+     ]
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "import requests\n",
+    "\n",
+    "# URL of the dataset\n",
+    "url = \"http://lib.stat.cmu.edu/datasets/boston\"\n",
+    "\n",
+    "# Send a GET request to fetch the data\n",
+    "response = requests.get(url)\n",
+    "\n",
+    "# Check if the request was successful\n",
+    "if response.status_code == 200:\n",
+    "    # Decode the content of the response\n",
+    "    content = response.content.decode(\"utf-8\")\n",
+    "    \n",
+    "    # Save the content to a CSV file\n",
+    "    with open(\"boston_dataset.csv\", \"w\") as f:\n",
+    "        f.write(content)\n",
+    "\n",
+    "    print(\"Data saved successfully as boston_dataset.csv\")\n",
+    "\n",
+    "    # Optionally, you can also load the data into a DataFrame\n",
+    "    # data = pd.read_csv(\"boston_dataset.csv\")\n",
+    "\n",
+    "else:\n",
+    "    print(\"Failed to fetch data from the URL\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "ParserError",
+     "evalue": "Error tokenizing data. C error: Expected 3 fields in line 3, saw 5\n",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[1;31mParserError\u001b[0m                               Traceback (most recent call last)",
+      "Cell \u001b[1;32mIn[10], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m df \u001b[38;5;241m=\u001b[39m \u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread_csv\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mboston_dataset.csv\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n",
+      "File \u001b[1;32mc:\\Users\\hp\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:948\u001b[0m, in \u001b[0;36mread_csv\u001b[1;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)\u001b[0m\n\u001b[0;32m    935\u001b[0m kwds_defaults \u001b[38;5;241m=\u001b[39m _refine_defaults_read(\n\u001b[0;32m    936\u001b[0m     dialect,\n\u001b[0;32m    937\u001b[0m     delimiter,\n\u001b[1;32m   (...)\u001b[0m\n\u001b[0;32m    944\u001b[0m     dtype_backend\u001b[38;5;241m=\u001b[39mdtype_backend,\n\u001b[0;32m    945\u001b[0m )\n\u001b[0;32m    946\u001b[0m kwds\u001b[38;5;241m.\u001b[39mupdate(kwds_defaults)\n\u001b[1;32m--> 948\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_read\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfilepath_or_buffer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mkwds\u001b[49m\u001b[43m)\u001b[49m\n",
+      "File \u001b[1;32mc:\\Users\\hp\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:617\u001b[0m, in \u001b[0;36m_read\u001b[1;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[0;32m    614\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m parser\n\u001b[0;32m    616\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m parser:\n\u001b[1;32m--> 617\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mparser\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnrows\u001b[49m\u001b[43m)\u001b[49m\n",
+      "File \u001b[1;32mc:\\Users\\hp\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:1748\u001b[0m, in \u001b[0;36mTextFileReader.read\u001b[1;34m(self, nrows)\u001b[0m\n\u001b[0;32m   1741\u001b[0m nrows \u001b[38;5;241m=\u001b[39m validate_integer(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnrows\u001b[39m\u001b[38;5;124m\"\u001b[39m, nrows)\n\u001b[0;32m   1742\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m   1743\u001b[0m     \u001b[38;5;66;03m# error: \"ParserBase\" has no attribute \"read\"\u001b[39;00m\n\u001b[0;32m   1744\u001b[0m     (\n\u001b[0;32m   1745\u001b[0m         index,\n\u001b[0;32m   1746\u001b[0m         columns,\n\u001b[0;32m   1747\u001b[0m         col_dict,\n\u001b[1;32m-> 1748\u001b[0m     ) \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_engine\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[43m  \u001b[49m\u001b[38;5;66;43;03m# type: ignore[attr-defined]\u001b[39;49;00m\n\u001b[0;32m   1749\u001b[0m \u001b[43m        \u001b[49m\u001b[43mnrows\u001b[49m\n\u001b[0;32m   1750\u001b[0m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m   1751\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m:\n\u001b[0;32m   1752\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mclose()\n",
+      "File \u001b[1;32mc:\\Users\\hp\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\pandas\\io\\parsers\\c_parser_wrapper.py:234\u001b[0m, in \u001b[0;36mCParserWrapper.read\u001b[1;34m(self, nrows)\u001b[0m\n\u001b[0;32m    232\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m    233\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlow_memory:\n\u001b[1;32m--> 234\u001b[0m         chunks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_reader\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread_low_memory\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnrows\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m    235\u001b[0m         \u001b[38;5;66;03m# destructive to chunks\u001b[39;00m\n\u001b[0;32m    236\u001b[0m         data \u001b[38;5;241m=\u001b[39m _concatenate_chunks(chunks)\n",
+      "File \u001b[1;32mparsers.pyx:843\u001b[0m, in \u001b[0;36mpandas._libs.parsers.TextReader.read_low_memory\u001b[1;34m()\u001b[0m\n",
+      "File \u001b[1;32mparsers.pyx:904\u001b[0m, in \u001b[0;36mpandas._libs.parsers.TextReader._read_rows\u001b[1;34m()\u001b[0m\n",
+      "File \u001b[1;32mparsers.pyx:879\u001b[0m, in \u001b[0;36mpandas._libs.parsers.TextReader._tokenize_rows\u001b[1;34m()\u001b[0m\n",
+      "File \u001b[1;32mparsers.pyx:890\u001b[0m, in \u001b[0;36mpandas._libs.parsers.TextReader._check_tokenize_status\u001b[1;34m()\u001b[0m\n",
+      "File \u001b[1;32mparsers.pyx:2058\u001b[0m, in \u001b[0;36mpandas._libs.parsers.raise_parser_error\u001b[1;34m()\u001b[0m\n",
+      "\u001b[1;31mParserError\u001b[0m: Error tokenizing data. C error: Expected 3 fields in line 3, saw 5\n"
+     ]
+    }
+   ],
+   "source": [
+    "df = pd.read_csv(\"boston_dataset.csv\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "ImportError",
+     "evalue": "\n`load_boston` has been removed from scikit-learn since version 1.2.\n\nThe Boston housing prices dataset has an ethical problem: as\ninvestigated in [1], the authors of this dataset engineered a\nnon-invertible variable \"B\" assuming that racial self-segregation had a\npositive impact on house prices [2]. Furthermore the goal of the\nresearch that led to the creation of this dataset was to study the\nimpact of air quality but it did not give adequate demonstration of the\nvalidity of this assumption.\n\nThe scikit-learn maintainers therefore strongly discourage the use of\nthis dataset unless the purpose of the code is to study and educate\nabout ethical issues in data science and machine learning.\n\nIn this special case, you can fetch the dataset from the original\nsource::\n\n    import pandas as pd\n    import numpy as np\n\n    data_url = \"http://lib.stat.cmu.edu/datasets/boston\"\n    raw_df = pd.read_csv(data_url, sep=\"\\s+\", skiprows=22, header=None)\n    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])\n    target = raw_df.values[1::2, 2]\n\nAlternative datasets include the California housing dataset and the\nAmes housing dataset. You can load the datasets as follows::\n\n    from sklearn.datasets import fetch_california_housing\n    housing = fetch_california_housing()\n\nfor the California housing dataset and::\n\n    from sklearn.datasets import fetch_openml\n    housing = fetch_openml(name=\"house_prices\", as_frame=True)\n\nfor the Ames housing dataset.\n\n[1] M Carlisle.\n\"Racist data destruction?\"\n<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>\n\n[2] Harrison Jr, David, and Daniel L. Rubinfeld.\n\"Hedonic housing prices and the demand for clean air.\"\nJournal of environmental economics and management 5.1 (1978): 81-102.\n<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>\n",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[1;31mImportError\u001b[0m                               Traceback (most recent call last)",
+      "Cell \u001b[1;32mIn[1], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mdatasets\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m load_boston\n",
+      "File \u001b[1;32mc:\\Users\\hp\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\sklearn\\datasets\\__init__.py:157\u001b[0m, in \u001b[0;36m__getattr__\u001b[1;34m(name)\u001b[0m\n\u001b[0;32m    108\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m name \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mload_boston\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[0;32m    109\u001b[0m     msg \u001b[38;5;241m=\u001b[39m textwrap\u001b[38;5;241m.\u001b[39mdedent(\u001b[38;5;124m\"\"\"\u001b[39m\n\u001b[0;32m    110\u001b[0m \u001b[38;5;124m        `load_boston` has been removed from scikit-learn since version 1.2.\u001b[39m\n\u001b[0;32m    111\u001b[0m \n\u001b[1;32m   (...)\u001b[0m\n\u001b[0;32m    155\u001b[0m \u001b[38;5;124m        <https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>\u001b[39m\n\u001b[0;32m    156\u001b[0m \u001b[38;5;124m        \u001b[39m\u001b[38;5;124m\"\"\"\u001b[39m)\n\u001b[1;32m--> 157\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mImportError\u001b[39;00m(msg)\n\u001b[0;32m    158\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m    159\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mglobals\u001b[39m()[name]\n",
+      "\u001b[1;31mImportError\u001b[0m: \n`load_boston` has been removed from scikit-learn since version 1.2.\n\nThe Boston housing prices dataset has an ethical problem: as\ninvestigated in [1], the authors of this dataset engineered a\nnon-invertible variable \"B\" assuming that racial self-segregation had a\npositive impact on house prices [2]. Furthermore the goal of the\nresearch that led to the creation of this dataset was to study the\nimpact of air quality but it did not give adequate demonstration of the\nvalidity of this assumption.\n\nThe scikit-learn maintainers therefore strongly discourage the use of\nthis dataset unless the purpose of the code is to study and educate\nabout ethical issues in data science and machine learning.\n\nIn this special case, you can fetch the dataset from the original\nsource::\n\n    import pandas as pd\n    import numpy as np\n\n    data_url = \"http://lib.stat.cmu.edu/datasets/boston\"\n    raw_df = pd.read_csv(data_url, sep=\"\\s+\", skiprows=22, header=None)\n    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])\n    target = raw_df.values[1::2, 2]\n\nAlternative datasets include the California housing dataset and the\nAmes housing dataset. You can load the datasets as follows::\n\n    from sklearn.datasets import fetch_california_housing\n    housing = fetch_california_housing()\n\nfor the California housing dataset and::\n\n    from sklearn.datasets import fetch_openml\n    housing = fetch_openml(name=\"house_prices\", as_frame=True)\n\nfor the Ames housing dataset.\n\n[1] M Carlisle.\n\"Racist data destruction?\"\n<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>\n\n[2] Harrison Jr, David, and Daniel L. Rubinfeld.\n\"Hedonic housing prices and the demand for clean air.\"\nJournal of environmental economics and management 5.1 (1978): 81-102.\n<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.datasets import load_boston"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}