-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Goscrapy requires Go version 1.22 or higher to run.
Goscrapy provides the goscrapy cli tool to help you scaffold a goscrapy project.
Usage
- Install
go install github.com/tech-engine/goscrapy@latest
- Verify installation
goscrapy -v
- Create a project
goscrapy startproject scrapejsp
- Create a custom pipeline
goscrapy pipeline export_2_DB
GoScrapy operates around the below three concepts.
- Job: Describes an input to your spider.
- Record: Represents an output produced by your spider.
- Spider: Contains the main logic of your scraper.
Job represents an input to goscrapy spider and must implement core.IJob interface.
type IJob interface {
Id() string
}
type Job struct {
id string
// add your own fields here
}
func (j *Job) Id() string {
return j.id
}
A Record represents an output produced by a spider(via yield) and must implement core.IOutput.
type IOutput interface {
Record() *Record
RecordKeys() []string
RecordFlat() []any
Job() IJob
}
type Record struct {
J *Job `json:"-" csv:"-"`
}
func (r *Record) Record() *Record {
return r
}
func (r *Record) RecordKeys() []string {
....
keys := make([]string, numFields)
....
return keys
}
func (r *Record) RecordFlat() []any {
....
return slice
}
func (r *Record) Job() core.IJob {
return r.J
}
Encapsulates the main logic of a goscrapy spider. We embed gos.ICoreSpider to make our spider work.
type Spider struct {
gos.ICoreSpider[*Record]
}
func New(ctx context.Context) (*Spider, <-chan error) {
// use proxies
// proxies := core.WithProxies("proxy_url1", "proxy_url2", ...)
// core := gos.New[*Record]().WithClient(
// gos.DefaultClient(proxies),
// )
core := gos.New[*Record]()
// Add middlewares
core.MiddlewareManager.Add(MIDDLEWARES...)
// Add pipelines
core.PipelineManager.Add(PIPELINES...)
errCh := make(chan error)
go func() {
errCh <- core.Start(ctx)
}()
return &Spider{
core,
}, errCh
}
// This is the entrypoint to the spider
func (s *Spider) StartRequest(ctx context.Context, job *Job) {
// for each request we must call NewRequest() and never reuse it
req := s.NewRequest()
var headers http.Header
/* GET is the request method, method chaining possible
req.Url("<URL_HERE>").
Meta("MY_KEY1", "MY_VALUE").
Meta("MY_KEY2", true).
Header(headers)
*/
/* POST
req.Url(<URL_HERE>)
req.Method("POST")
req.Body(<BODY_HERE>)
*/
// call the next parse method
s.Request(req, s.parse)
}
// can be called when spider exits
func (s *Spider) Close(ctx context.Context) {
}
func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) {
// response.Body()
// response.StatusCode()
// response.Header()
// response.Bytes()
// response.Meta("MY_KEY1")
// yielding output pushes output to be processed by pipelines, also check output.go for the fields
var data Record
err := json.Unmarshal(resp.Bytes(), &data)
if err != nil {
log.Panicln(err)
}
// s.Yield(&data)
}
In addition to all the files discussed, we also have settings.go where we can import all middlewares and pipelines we want to use in our project.
// HTTP Transport settings
// Default: 10000
const MIDDLEWARE_HTTP_TIMEOUT_MS = ""
// Default: 100
const MIDDLEWARE_HTTP_MAX_IDLE_CONN = ""
// Default: 100
const MIDDLEWARE_HTTP_MAX_CONN_PER_HOST = ""
// Default: 100
const MIDDLEWARE_HTTP_MAX_IDLE_CONN_PER_HOST = ""
// Inbuilt Retry middleware settings
// Default: 3
const MIDDLEWARE_HTTP_RETRY_MAX_RETRIES = ""
// Default: 500, 502, 503, 504, 522, 524, 408, 429
const MIDDLEWARE_HTTP_RETRY_CODES = ""
// Default: 1s
const MIDDLEWARE_HTTP_RETRY_BASE_DELAY = ""
// Default: 1000000
const SCHEDULER_REQ_RES_POOL_SIZE = ""
// Default: num. of CPU * 3
const SCHEDULER_CONCURRENCY = ""
// Default: 1000000
const SCHEDULER_WORK_QUEUE_SIZE = ""
// Pipeline Manager settings
// Default: 10000
const PIPELINEMANAGER_ITEMPOOL_SIZE = ""
// Default: 24
const PIPELINEMANAGER_ITEM_SIZE = ""
// Default: 0
const PIPELINEMANAGER_OUTPUT_QUEUE_BUF_SIZE = ""
// Default: 1000
const PIPELINEMANAGER_MAX_PROCESS_ITEM_CONCURRENCY = ""
// Middlewares here
var MIDDLEWARES = []middlewaremanager.Middleware{
middlewares.Retry(),
middlewares.MultiCookieJar,
middlewares.DupeFilter,
}
var export2CSV = pipelines.Export2CSV[*Record](pipelines.Export2CSVOpts{
Filename: "itstimeitsnowornever.csv",
})
// Pipelines here
var PIPELINES = []pm.IPipeline[*Record]{
export2CSV,
// export2Json,
}
...
- scrapejsp - api scraping
- scrapejsp_method2 [This new method is recommended] - api scraping
- books.toscrape.com - html scraping
More examples coming...
func main() {
ctx, cancel := context.WithCancel(context.Background())
var wg sync.WaitGroup
wg.Add(1)
spider, errCh := test1.New(ctx)
go func() {
defer wg.Done()
err := <-errCh
if err != nil && errors.Is(err, context.Canceled) {
return
}
fmt.Printf("failed: %q", err)
}()
// start the scraper with a job, currently nil is passed but you can pass your job here
spider.StartRequest(ctx, nil)
OnTerminate(func() {
fmt.Println("exit signal received: shutting down gracefully")
cancel()
wg.Wait()
})
}
Customize the Default client.
Option | Description | Default |
---|---|---|
WithProxies | Accepts multiple proxy url strings. | by default client uses proxy from enviroment |
WithTimeout | Http client timeout. | 10 seconds |
WithMaxIdleConns | Controls the max no. of idle(keep-alive) conns. across all hosts. 0 means unlimited. | 100 |
WithMaxIdleConnsPerHost | Same as WithMaxIdleConns but per host. | 100 |
WithMaxConnsPerHost | Limits the total no. of conns. per host. 0 mean unlimited. | 100 |
WithProxyFn | Accepts a custom proxy function for transport. | Round robin |
[spider.go]
func New(ctx context.Context) (*Spider, <-chan error) {
// default client options
// proxies := gos.WithProxies("proxy_url1", "proxy_url2", ...)
// core := gos.New[*Record]().WithClient(
// gos.DefaultClient(proxies),
// )
// we can also provide in our custom client
// core := gos.New[*Record]().WithClient(myCustomHTTPClient)
}
Pipelines help in managing, transforming, and fine-tuning the scraped data.
We can add pipelines using coreSpider.PipelineManager.Add().
[settings.go]
// use export 2 csv pipeline
export2Csv := pipelines.Export2CSV[*scrapejsp.Record](pipelines.Export2CSVOpts{
Filename: "itstimeitsnowornever.csv",
})
// use export 2 json pipeline
export2Json := pipelines.Export2JSON[*scrapejsp.Record](pipelines.Export2JSONOpts{
Filename: "itstimeitsnowornever.json",
Immediate: true,
})
A Group allows us to execute multiple pipelines concurrently. All pipelines in a group behave as one single pipeline. This can be useful in scenarios we may want to export our data both to multiple destinations. Instead of exporting sequentially, we can bundle them together in a group.
Pipelines in a group shouldn't be used for data transformation but for independent tasks like data exporting to a database etc.
[settings.go]
func myCustomPipelineGroup() *pm.Group[*Record] {
// create a group
pipelineGroup := pm.NewGroup[*Record]()
pipelineGroup.Add(export2CSV)
// pipelineGroup.Add(export2Json)
return pipelineGroup
}
// Pipelines here
// Executed in the order they appear.
var PIPELINES = []pm.IPipeline[*Record]{
export2CSV,
// export2Json,
// myCustomPipelineGroup(), // use group as if it were a single pipeline
}
GoScrapy also support inbuilt and custom middlewares for manipulation outgoing request.
- MultiCookieJar - used for maintaining different cookie sessions while scraping.
- DupeFilter - filters duplicate requests
- Retry - retry request with exponential back-off upon failure or with http status codes 500, 502, 503, 504, 522, 524, 408, 429
Option | Description | Default |
---|---|---|
MaxRetries | Additional retries after failure. | 3 |
Codes | Http code to trigger retry. | 500, 502, 503, 504, 522, 524, 408, 429 |
BaseDelay | Exponential Backoff multiplier. | 1 second |
Cb | Callback executed after every retry. If callback returns false, further retry is skipped. | nil |
We can add middlewares using gos.MiddlewareManager.Add().
[settings.go]
var MIDDLEWARES = []middlewaremanager.Middleware{
middlewares.Retry(),
middlewares.MultiCookieJar,
middlewares.DupeFilter,
}
GoScrapy supports custom pipelines. To create one, you can use goscrapy cli.
abc\go\go-test-scrapy>scrapejsp> goscrapy pipeline export_2_DB
✔️ pipelines\export_2_DB.go
✨ Congrates, export_2_DB created successfully.
To create one, you can use goscrapy cli. Custom middlewares must have the below function signature.
func MultiCookieJar(next http.RoundTripper) http.RoundTripper {
return core.MiddlewareFunc(func(req *http.Request) (*http.Response, error) {
// you middleware custom code here
})
}
GoScrapy supports CSS and XPATH selectors.
[spider.go]
func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) {
// CSS selector - select all products a tags and extract the href attribute value
var productUrls []string
productUrls = resp.Css("article.product_pod h3 a").Attr("href")
// select all the text node values
var productNames []string
productNames = resp.Css("article.product_pod h3 a").Text()
// Selector chaining is possible too
productUrls = resp.Css("article.product_pod").Css("h3 a").Attr("href")
// Xpath selector
productUrls = resp.Xpath("//article[contains(@class, 'product_pod')]//h3//a").Attr("href")
// chaining xpath and css also possible
productUrls = resp.Xpath("//article[contains(@class, 'product_pod')]").Css("h3 a").Attr("href")
// Get all matching nodes
var productUrlNodes []*html.Node
productUrlNodes = resp.Css("article.product_pod h3 a").GetAll()
// Get the first matching node
var firstProductUrlNode *html.Node
firstProductUrlNode = resp.Css("article.product_pod h3 a").Get()
}