Skip to content

shamiraty/OPEN-STREAMLIT-PROJECTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BUSINESS INTELLIGENCE: KPI, TRENDS & PREDICTION

AUTOMATED INFERENTIAL & DESCRIPTIVE STATISTICS

DATA-DRIVEN WEB APPLICATION WITH PYTHON AND STREAMLIT

Live Demo

Live Demo

YouTube

Watch the video

TARGET AUDIENCE

Serial No. Targeted Audience
1 Scientific Research
2 Health & Medical Laboratories
3 Data Scientists
4 Educational Institutions
5 Statisticians
6 Natural Mathematics Researchers
7 Information System Analysts
8 Software Developers
9 Finance & Accounting Professionals
10 Machine Learning Engineers
11 Market Analysts
12 Economists
13 Business Intelligence Analysts
14 Operations Researchers
15 Environmental Scientists
16 Policy Analysts
17 Social Science Researchers
18 Clinical Data Managers
19 Actuaries
20 Product Managers

INTRODUCTION

My name is Sameer, a data scientist and software developer. I have designed this application as a foundation for others to solve data-related problems across fields such as scientific research, finance, and health. Through this application, you can learn how data is collected, cleaned, analyzed, and interpreted to derive meaningful insights.

PROBLEM STATEMENT

Many organizations and researchers struggle to analyze vast amounts of data efficiently. Traditional methods can be time-consuming and often require specialized knowledge in both statistics and programming. The challenge is to provide an accessible tool that leverages data science and statistical techniques to automate data analysis tasks for various applications, including predictive modeling, trend analysis, and hypothesis testing.

MAIN OBJECTIVE

The main objective of this project is to provide knowledge about statistical and machine learning models, demonstrating how scientific computing and programming can be used to automate complex analysis tasks. By making these techniques accessible, users can enhance their decision-making processes and generate insights more effectively.

METHODOLOGY

  1. Data Collection: Data is gathered from relevant sources depending on the field of application (e.g., health records, financial data, survey data).
  2. Data Cleaning: The data undergoes preprocessing to handle missing values, correct inaccuracies, and transform data types for accurate analysis.
  3. Data Analysis: Using descriptive and inferential statistical methods, key patterns and trends are identified within the dataset.
  4. Model Development: Machine learning models, including regression and classification models, are developed to predict outcomes and identify patterns.
  5. Visualization: Interactive visualizations such as histograms, ogives, and scatter plots help in the intuitive understanding of results.
  6. Interpretation: Insights are derived from the results, helping users make data-driven decisions relevant to their field of interest.

PROJECT FEATURES:

S.No Topic Description
1 CO-VARIANCE Measure of the joint variability of two variables.
2 ADVANCED MULTIVARIATE REGRESSION Regression techniques involving multiple predictors and response variables.
3 TRENDS BY GEO-REFERENCING Analyze data trends based on geographic information.
4 DESCRIPTIVE STATISTICS ANALYTICS Summary and analysis of data with central tendency, dispersion, etc.
5 MULTIPLE REGRESSION ANALYSIS Model the relationship between one dependent variable and multiple independent variables.
6 SALES TRENDS BY DATE RANGE Analyze sales patterns over a specified time period.
7 BUSINESS TARGET BY PROGRESS Evaluate business performance relative to targets.
8 INTERACTIVE VISUALIZATION GRAPHS Dynamic and user-interactive data visualizations.
9 STATISTICS FOR GROUPED DATA Statistical analysis where data is organized into groups or intervals.
10 STATISTICS FOR UNGROUPED DATA Statistical analysis of raw, ungrouped data values.
11 ADVANCED PYTHON QUERY Techniques for complex data querying using Python.
12 OUTLIER DETECTION TECHNIQUES Methods for identifying abnormal data points in datasets.
13 HYPOTHESIS TESTING Statistical method to test assumptions or claims about a population.
14 FREQUENCY DISTRIBUTION Representation of data showing the number of observations within intervals.
15 NORMAL DISTRIBUTIONS Bell-shaped distribution that is symmetrical about the mean.
16 PROBABILITY DISTRIBUTIONS Function that shows the likelihood of different outcomes in an experiment.
17 LOGISTIC REGRESSION Model to estimate probabilities and model binary outcomes.
18 ESTIMATION OF POPULATION Inference of population parameters based on sample data.
19 PROBABILITY DENSITY Function describing the likelihood of a continuous random variable's outcome.

PROJECT PAGES

PAGE 1: DESCRIPTIVE STATISTICS FOR GROUPED DATA

1. Data Loading

  • Data Source: Loads dataset from a CSV file for analysis.

2. Age Interval Calculation

  • Purpose: Creates age intervals (e.g., 0-10, 11-20) and labels for categorizing age data into discrete groups.

3. Frequency Table Creation

  • Purpose: Generates a table counting occurrences within each age interval, facilitating grouped data analysis.

4. Grouped Statistical Calculations

  • Purpose: Calculates essential statistics for grouped data, aiding in understanding data distribution:
    • Mean: Computes the weighted average midpoint of age intervals.
    • Mode: Identifies the most frequent age interval.
    • Median: Determines the midpoint interval in cumulative frequency.
    • Variance and Standard Deviation: Measures the spread of data points around the mean.
    • Skewness and Kurtosis: Assesses the symmetry and peakedness of the data distribution.
    • Interquartile Range (IQR): Calculates the spread between the first and third quartiles.
    • Standard Error: Measures the precision of the sample mean.

5. Metric Display in Streamlit

  • Purpose: Displays key grouped data statistics (mean, median, mode, etc.) to the user in an interactive dashboard.

6. Skewness Visualization

  • Purpose: Plots a normal distribution curve to visualize data symmetry, with an annotation for skewness, allowing for a visual assessment of data distribution shape.

7. Frequency Table Display

  • Purpose: Presents a frequency table with cumulative frequencies, providing insights into data distribution across age intervals.

PAGE 2: DESCRIPTIVE STATISTICS & DATA VISUALIZATION

  1. Data Loading and Selection

    • Loads data from an Excel file (data.xlsx) and uses it for analytical processing.
    • Allows users to filter data by Region, Location, and Construction fields for customized analysis.
  2. Descriptive Analytics

    • Computes key summary statistics such as Sum, Mode, Mean, and Median for the Investment column.
    • Displays these metrics in the Streamlit interface for easy visualization.
  3. Data Visualization

    • Histograms: Visualizes the frequency distribution of variables in the dataset.
    • Bar Chart: Shows investments by BusinessType, providing a breakdown of investments by type.
    • Line Chart: Visualizes investments by State, showing trends across different states.
    • Pie Chart: Represents Ratings by Region, showing the proportion of ratings for each region.
  4. Target Tracking and Progress Bar

    • Defines a target for investment and calculates the current percentage toward this target.
    • Provides a progress bar to visually represent how close the current investment is to the target.
  5. Quartile Analysis

    • Uses a box plot to analyze the distribution of Investment by BusinessType, displaying quartiles and helping identify outliers.
  6. User Interface with Interactive Elements

    • Includes an interactive sidebar with options to navigate between different views (Home, Progress).
    • Enables selection of quantitative features for exploring distributions and trends.

PAGE 3: HYPOTHESIS TESTING

  • Data Loading and Cleaning:

    • Reads data from an Excel file (hypothesis.xlsx).
    • Drops unnecessary columns to focus on relevant fields for hypothesis testing.
  • Hypothesis Formulation:

    • Defines null and alternative hypotheses for comparing the mean revenues of Group A and Group B.
  • Confidence Level Setup:

    • Sets a confidence level of 95% for statistical significance in hypothesis testing.
  • T-Test for Independent Samples:

    • Conducts a t-test to compare means of two independent groups (Group A and Group B).
    • Calculates t-statistic and p-value for hypothesis evaluation.
  • Sample Statistics Calculation:

    • Computes and displays sample mean and standard deviation for both groups.
    • Confirms sample size and enforces t-distribution usage only for samples smaller than 30.
  • Critical Value Determination:

    • Calculates the critical value based on confidence level and sample size.
  • T-Distribution Curve Generation:

    • Generates a probability density curve for visualizing the t-distribution.
  • Decision-Making:

    • Compares computed t-statistic with critical value to decide whether to reject the null hypothesis.
  • Visualization of Results:

    • Displays t-distribution curve with annotated critical value, t-statistic, and rejection region.
    • Uses visual aids (vertical lines, filled regions) to highlight decision boundaries and critical regions.
  • Summary Metrics Display:

    • Shows computed values and critical values in a dashboard format.
    • Presents a sample size and statistical metrics in a well-organized layout using Streamlit components.

PAGE 4: ADVANCED LINEAR REGRESSION

1. Data Loading and Selection

  • Data Source: Loaded from CSV file (advanced_regression.csv).
  • Feature Columns: interest_rate, unemployment_rate, index_price.
  • Filtering: Data is filtered based on user-selected year and month.

2. Exploratory Data Analysis (EDA)

Correlation Analysis:

  • Used sns.regplot to visually explore relationships between features.
  • Calculated and displayed correlation matrix for the variables.

Visualizing Relationships:

  • Regression plots show the relationships between interest_rate and unemployment_rate, interest_rate and index_price.
  • Box plots detect outliers in the dataset.

Variable Distributions:

  • Displayed histograms for variable frequency distributions.
  • Used sns.pairplot to examine pairwise relationships.

3. Handling Missing Data

  • Checked for missing values and displayed the count of NaN entries in each column.
  • Provided descriptive statistics (mean, standard deviation, etc.) for each variable.

4. Data Preprocessing

Splitting the Data:

  • Split the data into training and testing sets using train_test_split.

Standardization:

  • Applied standardization using StandardScaler to scale features.

5. Modeling

Multiple Linear Regression Model:

  • Built a linear regression model using LinearRegression from sklearn.
  • Used cross-validation to evaluate model performance.

Prediction:

  • Predicted the target variable (index_price) on the test dataset.

6. Model Evaluation

Performance Metrics:

  • Calculated and displayed Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE).

R² and Adjusted R²:

  • Calculated and displayed the R² and Adjusted R² values for model performance.

Residuals Analysis:

  • Computed residuals and visualized them using a normal distribution curve to check the error distribution.

7. Statistical Analysis

  • Used OLS (Ordinary Least Squares) regression from statsmodels to obtain detailed model insights, including coefficients and p-values.

PAGE 5: CO-VARIANCE

  • Data Loading: Loads data from an Excel file, allowing for further statistical operations and visualizations.

  • Feature Selection: Provides a feature selection for X variable, enabling dynamic analysis of various numerical features against the target variable.

  • Statistical Model Fitting: Fits an Ordinary Least Squares (OLS) regression model to examine the relationship between the selected X feature and the target variable (Projects).

  • Key Statistical Metrics Calculation:

    • Intercept: Displays the intercept term of the model, representing the baseline effect on Projects.
    • R-Squared: Shows the R-squared value, providing insight into the model's explanatory power.
    • Adjusted R-Squared: Adjusts for the number of predictors to gauge model fit accuracy.
    • Standard Error: Provides the standard error, indicating the precision of the intercept estimate.
  • Predictions and Residuals Calculation: Calculates model predictions and residuals for further analysis.

  • Data Visualization:

    • Line of Best Fit Plot: Generates a scatter plot with a line of best fit to visualize the relationship between the selected X feature and Projects, assessing the model fit visually.
    • Grid and Border Customization: Customizes plot appearance for better interpretability.

PAGE 6: DESCRIPTIVE STATISTICS FOR UNGROUPED DATA

  1. Data Loading

    • Load dataset from a CSV file for analysis.
  2. Quartile and IQR Calculation

    • Calculate the 1st Quartile (Q1), 3rd Quartile (Q3), and Interquartile Range (IQR) for understanding the spread of the dataset.
  3. Basic Statistics Computation

    • Determine minimum, maximum, and median values to summarize the dataset's range and central tendency.
  4. Ogives Plotting

    • Generate Less Than and Greater Than Ogives to visualize cumulative frequency distribution.
    • Add a vertical line and annotation for the median value to highlight central tendency in the plot.
  5. Display Statistics in Streamlit Dashboard

    • Display quartiles, IQR, min, max, and median values in an interactive layout for user insights.
    • Apply styling to metrics for improved readability and visual appeal.
  6. Interactive Visualization

    • Present the ogives plot in Streamlit to allow for intuitive data exploration.

PAGE 7: DATA VISUALIZATION TECHNIQUES

  1. Data Loading and Selection

    • Loads data from an Excel file (data.xlsx) and uses it for analytical processing.
    • Allows users to filter data by Region, Location, and Construction fields for customized analysis.
  2. Descriptive Analytics

    • Computes key summary statistics such as Sum, Mode, Mean, and Median for the Investment column.
    • Displays these metrics in the Streamlit interface for easy visualization.
  3. Data Visualization

    • Histograms: Visualizes the frequency distribution of variables in the dataset.
    • Bar Chart: Shows investments by BusinessType, providing a breakdown of investments by type.
    • Line Chart: Visualizes investments by State, showing trends across different states.
    • Pie Chart: Represents Ratings by Region, showing the proportion of ratings for each region.
  4. Target Tracking and Progress Bar

    • Defines a target for investment and calculates the current percentage toward this target.
    • Provides a progress bar to visually represent how close the current investment is to the target.
  5. Quartile Analysis

    • Uses a box plot to analyze the distribution of Investment by BusinessType, displaying quartiles and helping identify outliers.
  6. User Interface with Interactive Elements

    • Includes an interactive sidebar with options to navigate between different views (Home, Progress).
    • Enables selection of quantitative features for exploring distributions and trends.

PAGE 8: LINEAR REGRESSION

1. Data Loading and Preprocessing:

  • The dashboard loads an Excel dataset (regression.xlsx) containing information on Dependant, Wives, and Projects.
  • Extracts the independent variables (Dependant and Wives) and the dependent variable (Projects) for use in regression analysis.

2. Model Fitting and Prediction:

  • A Linear Regression model is trained on the dataset using Dependant and Wives to predict the Projects (dependent variable).
  • Predictions are made using the trained model and stored for further analysis.

3. Regression Coefficients:

  • The Intercept (Bo) and Coefficients (B1, B2) for the independent variables are calculated and displayed. These represent the linear relationship between the predictors and the dependent variable.

4. Model Evaluation Metrics:

  • R-squared (R²): Measures the proportion of variance in the dependent variable explained by the independent variables.
  • Adjusted R-squared: Adjusts R² for the number of predictors in the model, preventing overfitting.
  • Sum of Squared Errors (SSE): Calculates the total error between the predicted and actual values.
  • Sum of Squared Regression (SSR): Measures the variation explained by the model.

5. Prediction Table:

  • Displays a table with the actual and predicted Projects (Y) values, along with the SSE and SSR values for each data point.

6. Residual Analysis:

  • Residuals: The difference between the actual and predicted values of Projects is calculated.
  • A scatter plot of the residuals versus the predicted values is displayed to visualize model fit.
  • A Kernel Density Estimation (KDE) plot of the residuals is shown to analyze their distribution.

7. User Input and Prediction:

  • Users can input new values for Dependant and Wives in a sidebar form.
  • Upon submission, the model predicts the number of Projects for the provided inputs and displays the result.

8. Download Option:

  • The user can download the dataset with the actual values, predicted values, SSE, and SSR as a CSV file.

9. Visualizations:

  • Regression Line and Scatter Plot: Visualizes the relationship between actual and predicted values, including the best fit line.
  • Residual Plot: Shows the distribution of residuals using a KDE plot.

PAGE 9: NORMAL DISTRIBUTION

1. Data Collection

  • The application uses an Excel file (normal_distr.xlsx) to load the dataset which contains student marks.

2. Data Preprocessing

  • The data is cleaned by extracting the 'Marks' column for analysis.
  • A slider is created for users to select an X value from the data range (min, max, mean).

3. Statistical Calculations

  • Mean & Standard Deviation:
    The application calculates the population mean and standard deviation of the marks.
  • Z-Score Calculation:
    The Z-score is calculated using the formula:
    Z = (X - Mean) / Standard Deviation, where X is the user-selected value.
  • Probability Calculation:
    The cumulative distribution function (CDF) for the Z-score is computed using the normal distribution.

4. Visualizations

  • Standard Normal Distribution Curve:
    A line plot of the standard normal distribution (Z ~ N(0, 1)) is generated using Plotly.
    • Red marker indicates the selected Z-score value.
    • The shaded area on the graph represents the probability for the selected Z-score value.
  • Standardized Marks Distribution:
    A plot shows the probability distribution of standardized marks.
  • Probability of Selected X:
    Another plot shows the probability density associated with the selected X value.

5. Standardization of Data

  • The application standardizes the marks (i.e., converts the marks into Z-scores) for comparison across datasets.
  • The standardized marks are added as a new column in the dataset.

6. Z-Table

  • A Z-table is generated which maps Z-scores to their corresponding cumulative probabilities.
  • The table allows the user to quickly reference the probability associated with different Z-scores.

7. Interactive Elements

  • Filters:
    The user can filter the data using a multiselect dropdown for columns such as "fullname", "gender", "Marks", "Probability", and "Standardized Marks".
  • PDF Download:
    The Z-table can be downloaded as a PDF file for further use or offline reference.

8. User Interaction

  • The sidebar allows the user to interact with the X value slider and see the corresponding changes in the graph and statistics.
  • Various interactive graphs display the probability distributions and Z-score information dynamically.

9. Statistical Insight

  • The application offers insights such as the probability of the selected X value, the Z-score, and the standard deviation, helping users understand the statistical significance of their data.

10. Output Display

  • The output is displayed in a structured layout with expandable sections for viewing different analyses:
    • Estimation Parameters
    • Normal Curves
    • Standardized Student Marks Table
    • Z Table

PAGE 10: ESTIMATION OF POPULATION PARAMETERS

Overview

In this page, we are performing a population estimation based on a sample dataset containing ages. The analysis involves calculating sample statistics and confidence intervals for the population mean and standard deviation. The critical steps and results are presented below, with visualizations to enhance the understanding of the statistical concepts.

Key Data Science and Statistical Concepts Used

1. Loading and Processing Data

The data is loaded from a CSV file, and the age column is extracted for statistical analysis.

2. Sample Statistics

  • Sample Size (n): The number of entries in the age column.
  • Sample Mean: The average age in the sample.
  • Sample Standard Deviation: The measure of variability in the sample.

3. Population Estimation

  • Population Size (N): The total number of individuals in the population (set to 1000 in this case).
  • Confidence Level (95%): The level of certainty we have in our estimation.

4. Confidence Intervals

  • Population Mean Confidence Interval: A range within which the true population mean is likely to lie, calculated using the sample mean and sample standard deviation.
  • Population Standard Deviation Confidence Interval: A range within which the true population standard deviation is likely to lie, calculated using the sample's chi-square distribution.

5. Standard Error of the Mean (SEM)

This metric is used to estimate the precision of the sample mean as an estimate of the population mean.

6. Critical Z-Value

We calculate the critical z-value for a 95% confidence level using the standard normal distribution, which helps in defining the range of values for the confidence interval.

7. Normal Distribution Curve

The normal distribution curve is plotted to represent the probability density of the sample mean. A shaded region is used to represent the 95% confidence interval for the population mean.

8. Plotly Visualization

  • A normal distribution curve is plotted using Plotly.
  • The 95% confidence interval is shaded under the curve to visualize the area within which the population mean is expected to lie.
  • Markers are added to highlight the sample mean and the confidence interval bounds.

PAGE 11: SALES ANALYTICS { CASE STUDY }

1. Data Import and Processing

  • Dataset Loading: A CSV file (sales.csv) is read into a pandas DataFrame for analysis.
  • Date Filtering: Users can filter the dataset by a date range (start and end dates). The data is filtered based on the OrderDate column to display relevant sales data.
  • Data Exploration: A DataFrame explorer is used to interactively view and filter the dataset, making it easier for users to explore the data.

2. Descriptive Analytics

  • Metrics Calculation:
    • Total Products in Inventory: Count of Product entries to display the number of inventory items.
    • Total Price Sum: The sum of all TotalPrice values is displayed to give an overall view of sales revenue.
    • Price Range Analysis:
      • Maximum and minimum price for products are calculated and displayed.
      • Price range (difference between the maximum and minimum prices) is calculated.
  • These metrics provide key insights into inventory and sales data.

3. Data Visualization

  • Dot Plot: A scatter plot is used to visualize the relationship between Product and TotalPrice. Each point represents a product with its corresponding total price, and products are color-coded by their category.
  • Bar Graph: A bar chart is used to display the relationship between Product and UnitPrice. The chart aggregates UnitPrice over months to show trends in pricing.
  • Scatter Plot: A scatter plot is created based on user-selected features. It visualizes relationships between categorical (qualitative) data (feature_x) and numerical (quantitative) data (feature_y).
  • Bar Chart of Quantities: A bar chart visualizes the total quantity sold for each product, helping to analyze product demand.

4. Interactive User Interface

  • Date Range Selection: Users can select a date range from the sidebar, allowing them to filter sales data dynamically.
  • Feature Selection: Users can select features for the x and y axes to explore relationships in the data through scatter plots.
  • Data Table: The filtered dataset is displayed interactively for further analysis.

5. Statistical and Business Insights

  • Price Range Insights: The metrics calculated (maximum, minimum, range) help users identify high-value and low-value products, which is critical for pricing strategies.
  • Sales Trend Analysis: The dot plot and bar charts help identify trends in product sales, such as which products have higher sales and which products are more expensive.
  • Business Metrics: The overall revenue and inventory metrics provide insights into the health of the business and help with decision-making.

CONCLUTION

This page is focused on descriptive analytics and basic statistics. The main tasks involve:

  • Data cleaning and filtering.
  • Displaying key business metrics related to product pricing and sales volume.
  • Visualizing the relationship between various features such as product prices and quantities.
  • Providing interactive tools for users to explore the dataset and extract insights.

OTHER VIDEO SERIES:

Data Science Web Development Job Recommendation System

  1. Part 1: Introduction
  2. Part 2: Admin Theme
  3. Part 3: Model Training and Prediction
  4. Part 4: View, URL, and Template Rendering
  5. Part 5: How to Generate 10000 Fake Dataset CSV
  6. Episode 6: Project Overview

OTHER STREAMLIT VIDEO SERIES

  1. Business Analytics Web Dashboard

  2. Analytics Website Dashboard

  3. Logistic Multiple Regression Analytics Web

  4. Normal Probability Distribution Analytics Web

  5. Python: Query Operations

  6. Python: Binomial Probability Distributions

  7. Hypothesis Testing T Distribution Curve

  8. Frequency Distribution Table

  9. Geo Referencing Business Trends

  10. Multiple Linear Regression Web Project

  11. Python: Web Dashboard: DashPlotly Framework and Dash

  12. Python: Web Dashboard using DashPlotly Framework

  13. Python: Multiple Linear Regression

  14. Logistic Regression Analysis

  15. PygWalker Graph Creator

  16. Sales Analytics Web Dashboard

  17. Analytics Dashboard with MySQL

  18. Business Intelligent Analytics Web Dashboard

  19. Descriptive Analytics Web Dashboard 1

  20. Descriptive Analytics Web Dashboard 2

  21. Analytics Dashboard Website with Graphs 3

  22. Add new Record to Excel file via Web Interface

  23. CrossTabulation Web App

2 3 4 5 20 19 18 17 9! 16 10 8 7 6 26 25 24 23 22 21 14 13 12
15

Contact Information

WhatsApp

  • +255675839840
  • +255656848274

YouTube

YouTube Channel

Telegram

  • +255656848274
  • +255738144353

PlayStore

PlayStore Developer Page

GitHub

GitHub Profile

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published