Releases: chris-greening/instascrape
v2.1.2
v2.1.0
New feature
instascrape.scrape_tools.scrape_posts
Takes a list of unscraped instascrape.Post
objects and scrapes them with a variety of different configurations and options for usage. Returns successfully scraped posts as well as the posts that were not successfully scraped.
Sample Usage
from instascrape import Post, scrape_posts
# Some code creating a list of posts and valid header info, etc...
# Scrape the first 100 posts
scraped_posts, unscraped_posts = scrape_posts(posts_list, headers=headers, limit=100)
# Scrape all posts since January 1st, 2020
import datetime
scraped_posts, unscraped_posts = scrape_posts(posts_list, headers=headers, limit=datetime.date(2020, 1, 1))
etc.
Available arguments
posts
: List[instascrape.Post]
Required, list of unscraped Post objectssession
: requests.Session
Optional, custom requests.Session objectwebdriver
: selenium.webdriver.chrome.webdriver.WebDriver
Optional, custom Selenium webdriver (overrides session if passed)limit
: Union[int, datetime.datetime]
Optional, integer or date value to stop scraping at. Defaults to all postsheaders
: dict
Optional, dictionary of request headerspause
: int
Optional, pause between scrapeson_exception
: str
Optional, available options when an exception occurs are "raise", "pass", "return". Defaults to "raise".silent
: bool
Optional, print output while scraping. Defaults to True (no output)inplace
: bool
Optional, directly modifies the post objects that are passed. Otherwise, creates a copy and returns lists of copies
v2.0.2
Fixes
- Fixed default None argument for
instascrape.scrapers.Profile.get_posts
. Passing a specificamount
works but not passing anything resulted in a comparison betweenNoneType
andint
v2.0.0
New features
Below is a list of new features
scrape tools
json_from_soup
Returns JSON Instagram data from BeautifulSoup
flatten_dict
Returns a flattened dictionary of all leaf nodes in a tree of JSON data
- New
flatten
argument for json_from_* functions, returns a flattened dictionary
scrapers
- New
inplace
argument for thescrape
method
Similar to the
pandas
inplace
parameter except the default isTrue
as opposed topandas
'sFalse
. By default, scrape will modify an instance inplace, setting attributes equal to the scraped data. IfFalse
, the current instance will remain untouched andscrape
will instead return another instance with the scraped data. Useful for chaining methods
- New 'session
parameter for the
scrape` method
Allows passing of a custom session object
- New
webdriver
parameter for thescrape
method
Uses a webdriver for scraping the data instead of a session
Fixes
- fixed
Post
scraper KeyError that was occuring on all scrapes
Breaking changes
Below is a list of breaking changes to the library
- Renamed
instascrape.scrapers.json_tools
toinstascrape.scrapers.scrape_tools
- Renamed
parse_json_from_mapping
function toparse_data_from_json
- Removed FlatJSONDict, replaced with the
flatten_dict
function inscrape_tools
that will flatten any dictionary json_from_*
functions now return a list of all JSON dictionary's from the page as opposed to just the first dictionary.
Non-breaking changes behind the scenes
Below is a list of everything that changed behind the scenes that has no bearing on the API
- refactored out a lot of complexity from
instascrape.core._static_scraper._StaticHtmlScraper
's implementation, greatly improving code readability - Changed imports to reflect file moves
- Reimplemented to rely more on reusable functions as opposed to static methods unnecessarily bound to classes
- Changed how data is loaded into namespace when using the
scrape
method to make room for theinplace
argument.inplace
is defaulted asTrue
so this doesn't break any existing code but instead provides a new alternative. - updated documentation with docstrings
v1.7.1
Deprecated data point
Removed business_email
as an available data point from instascrape.scrapers.Profile
scraper. Instagram seems to have removed the ability to view business email's from the web version of the platform and all values were being returned as nan
. This will be explored further in the future but for now it is being removed.
v1.7.0
Deprecations
Officially removed deprecated methods from all scrapers as listed below
All scrapers
load
instance method
instascrape.scrapers.Hashtag
from_profile
class method
instascrape.scrapers.Post
from_shortcode
class method
instascrape.scrapers.Profile
from_username
class method
The functionality for all of these methods is covered by the scrape
instance method and are thus redundant and less powerful.
Documentation
- Removed misleading documentation for outdated scrapers. Improved existing scrapers
- Added and improved type hints
v1.6.1
Docs
Added type hints for better documentation
v1.6.0
New feature
Added instascrape.scrapers.IGTV
for scraping IGTV posts. instascrape.scrapers.IGTV
is a subclass of instascrape.scrapers.Post
and thus inherited all of its methods and behaviors
Sample usage:
from instascrape import IGTV
google_igtv = IGTV('https://www.instagram.com/tv/CIrIIMYl8VQ/')
google_igtv.scrape()
v1.5.0
New feature
Introduced the Reel
scraper for scraping Instagram reels. Reel
is a subclass of Post
so pretty much everything you expect from Post
is available in Reel
as well.
Sample usage:
from instascrape import Reel
sample_reel = Reel("https://www.instagram.com/reel/CIrJSrFFHM_/")
sample_reel.scrape()
Bug fixes
json_from_url
Added optional/default request headers argument to instascrape.scrapers.json_from_url
unit tests
Fixed some of the broken unit tests. The library was fine but some of the tests were a little outdated and needed what appears to be required browser headers now to run properly.
v1.4.0
New features
Location scraper
Ability to scrape Instagram Location pages.
Sample usage
from instascrape import Location
url = "https://www.instagram.com/explore/locations/212988663/new-york-new-york/"
new_york = Location(url)
new_york.scrape()
print(f"{new_york.amount_of_posts:,} people have been to New York"
>>> 61,202,403 people have been to New York
Optional header for requests
Now supports passing an optional browser header to the scrape
method of all scraper objects. Syntax is exactly the same as a header dict
you would pass to requests.get
.
The default header is
headers={"User-Agent": "user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57"}
Sample usage is
from instascrape import Profile
headers={"User-Agent": "user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57"}
google = Profile("google")
google.scrape(headers=headers)
Fixes
It appears Instagram tightened restrictions overnight, all GET requests from the library were being returned 429 HTTP response status codes (Too Many Requests). Prior to now, instascrape
did not pass or have any support for passing browser headers. This newest default and option to pass in headers seems to have returned library functioning for now. Keep an eye out for more robust session handling and better cookie support in later updates