Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working Update #1

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Binary file modified __pycache__/getMatchIDs.cpython-36.pyc
Binary file not shown.
Binary file added __pycache__/getMatchIDs.cpython-37.pyc
Binary file not shown.
Binary file modified __pycache__/helper.cpython-36.pyc
Binary file not shown.
Binary file added __pycache__/helper.cpython-37.pyc
Binary file not shown.
Binary file added __pycache__/html.cpython-37.pyc
Binary file not shown.
Binary file added __pycache__/htmls.cpython-36.pyc
Binary file not shown.
Binary file added __pycache__/htmls.cpython-37.pyc
Binary file not shown.
Binary file modified __pycache__/scraper.cpython-36.pyc
Binary file not shown.
Binary file added __pycache__/scraper.cpython-37.pyc
Binary file not shown.
Binary file added __pycache__/start.cpython-37.pyc
Binary file not shown.
1,779 changes: 1 addition & 1,778 deletions csv/eventIDs.csv

Large diffs are not rendered by default.

1,783 changes: 1,783 additions & 0 deletions csv/eventIDs_1.csv

Large diffs are not rendered by default.

22,199 changes: 0 additions & 22,199 deletions csv/joinMatchEvent.csv

Large diffs are not rendered by default.

22,209 changes: 22,209 additions & 0 deletions csv/joinMatchEvent_1.csv

Large diffs are not rendered by default.

19,868 changes: 0 additions & 19,868 deletions csv/matchIDs.csv

Large diffs are not rendered by default.

19,879 changes: 19,879 additions & 0 deletions csv/matchIDs_1.csv

Large diffs are not rendered by default.

21,217 changes: 1 addition & 21,216 deletions csv/matchLineups.csv

Large diffs are not rendered by default.

21,225 changes: 21,225 additions & 0 deletions csv/matchLineups_1.csv

Large diffs are not rendered by default.

30,038 changes: 0 additions & 30,038 deletions csv/matchResults.csv

Large diffs are not rendered by default.

30,058 changes: 30,058 additions & 0 deletions csv/matchResults_1.csv

Large diffs are not rendered by default.

271,480 changes: 0 additions & 271,480 deletions csv/playerStats.csv

Large diffs are not rendered by default.

271,751 changes: 271,751 additions & 0 deletions csv/playerStats_1.csv

Large diffs are not rendered by default.

5,767 changes: 5,766 additions & 1 deletion csv/players.csv

Large diffs are not rendered by default.

2,722 changes: 2,720 additions & 2 deletions csv/teams.csv

Large diffs are not rendered by default.

3 changes: 1 addition & 2 deletions getMatchIDs.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from html import getHTML
from htmls import getHTML
import re


Expand Down Expand Up @@ -61,7 +61,6 @@ def endCheck(matchIDs, stop):
def findMatchIDsAtURL(url):
# Get the HTML using getHTML()
html = getHTML(url)

# Create an array of all of the Match URLs on the page
matchIDs = re.findall('"(.*?000"><a href="/matches/.*?)"', html)

Expand Down
29 changes: 17 additions & 12 deletions helper.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
from multiprocessing.dummy import Pool as ThreadPool
from html import getHTML
from htmls import getHTML
import csv
import sys

import numpy

def scrape(array, function, threads):
# Define the number of threads
Expand All @@ -12,7 +12,10 @@ def scrape(array, function, threads):
print("Scraping %s items using %s on %s threads." % (len(array), function, threads))

# Calls get() and adds the filesize returned each call to an array called filesizes
result = pool.map(function, array)
result = list(map(function, array))
# print("start")
# print(list(result))
# print("end")
pool.close()
pool.join()
return result
Expand All @@ -31,10 +34,10 @@ def addNewLine(file):
def tabulate(csvFile, array):
# Files must be in the csv directory inside the project folder
# Opens the CSV file
with open("csv/%s.csv" % (csvFile), 'a', encoding='utf-8') as f:
with open("csv/%s.csv" % (csvFile), 'a', newline='' ,encoding='utf-8') as f:
writer = csv.writer(f, delimiter=',')
# Adds a new line if there is not one present
addNewLine("csv/%s.csv" % (csvFile))
# addNewLine("csv/%s.csv" % (csvFile))
# Add the array passed in to the CSV file
for i in range(0, len(array)):
if len(array[i]) > 0:
Expand Down Expand Up @@ -85,14 +88,16 @@ def unDimension(array, item):
return result


def fixArray(array, value):
def fixArray(array):
# Used to clean match info results for matches with more than one map
for i in range(0, len(array)):
if len(array[i]) < value:
for b in range(0, len(array[i])):
array.append(array[i][b])
array.remove(array[i])
return array
newArray = []
for i in array:
if len(numpy.array(i).shape) == 2:
for temp in i:
newArray.append(temp)
else:
newArray.append(i)
return newArray


def fixPlayerStats(array):
Expand Down
11 changes: 7 additions & 4 deletions html.py → htmls.py
Original file line number Diff line number Diff line change
@@ -1,21 +1,24 @@
from urllib.request import Request, urlopen
import urllib.request
import re
import http


def getHTML(url):
# Open the URL
# Spoof the user agent
request = Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
req = Request(url=url, headers=headers)
# Read the response as HTML
try:
urlopen(request).read()
html = urlopen(request).read().decode('ascii', 'ignore')
html = urlopen(req).read().decode('ascii', 'ignore')
if len(re.findall('error-desc', html)) > 0:
return None
else:
return html
except urllib.error.HTTPError as err:
print("%s for %s" % (err.code, url))
return None
except:
print('END POINT ERROR')
return None
8 changes: 7 additions & 1 deletion readme.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
## Working Update

# HLTV Scraper

This is a multi-threaded Python scraper designed to pull data from HLTV.org and tabulate it into a series of CSV files. It is written in pure Python, so it should run on any system that can run Python 3. It is not compatible with Python 2, so you may need to install the latest Python release from [here](https://www.python.org/downloads/).
Expand Down Expand Up @@ -46,4 +48,8 @@ Each match has player stats for each map. The script looks for these statistics

## Updating Players and Teams

Each player and team on HLTV has a unique identification number that increases as new players are added to the database. To find new players and teams, we get the maximum identifier value form the respective `.csv` file and iterate over it using `getIterableItems`. From there the relevant pages are scraped and tabulated to `players.csv` and `teams.csv`.
Each player and team on HLTV has a unique identification number that increases as new players are added to the database. To find new players and teams, we get the maximum identifier value form the respective `.csv` file and iterate over it using `getIterableItems`. From there the relevant pages are scraped and tabulated to `players.csv` and `teams.csv`.

## Starting Over

If you made an entry of a more recent event and want to go beyond that, remove clear the csv : `matchIDs.csv` to restart.(do not remove first row, ID and Title)
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
bs4
numpy
Loading