Home

GoScrapy: Wiki

Prerequisites.

Goscrapy requires Go version 1.22 or higher to run.

Goscrapy cli.

Goscrapy provides the goscrapy cli tool to help you scaffold a goscrapy project.

Usage

Install

go install github.com/tech-engine/goscrapy@latest

Verify installation

goscrapy -v

Create a project

goscrapy startproject scrapejsp

Create a custom pipeline

goscrapy pipeline export_2_DB

Base Concepts

GoScrapy operates around the below three concepts.

Job: Describes an input to your spider.
Record: Represents an output produced by your spider.
Spider: Contains the main logic of your scraper.

Job

Job represents an input to goscrapy spider and must implement core.IJob interface.

type IJob interface {
    Id() string
}

job.go

type Job struct {
    id string
    // add your own fields here
}

func (j *Job) Id() string {
    return j.id
}

Record

A Record represents an output produced by a spider(via yield) and must implement core.IOutput.

type IOutput interface {
    Record() *Record
    RecordKeys() []string
    RecordFlat() []any
    Job() IJob
}

record.go

type Record struct {
    J    *Job   `json:"-" csv:"-"`
}

func (r *Record) Record() *Record {
    return r
}

func (r *Record) RecordKeys() []string {
    ....
    keys := make([]string, numFields)
    ....
    return keys
}

func (r *Record) RecordFlat() []any {
    ....
    return slice
}

func (r *Record) Job() core.IJob {
    return r.J
}

Spider

Encapsulates the main logic of a goscrapy spider. We embed gos.ICoreSpider to make our spider work.

spider.go

type Spider struct {
  gos.ICoreSpider[*Record]
}

func New(ctx context.Context) (*Spider, <-chan error) {

  // use proxies
  // proxies := core.WithProxies("proxy_url1", "proxy_url2", ...)
  // core := gos.New[*Record]().WithClient(
  // 	gos.DefaultClient(proxies),
  // )

  core := gos.New[*Record]()

  // Add middlewares
  core.MiddlewareManager.Add(MIDDLEWARES...)
  // Add pipelines
  core.PipelineManager.Add(PIPELINES...)

  errCh := make(chan error)

  go func() {
 	errCh <- core.Start(ctx)
  }()

  return &Spider{
	core,
  }, errCh
}

// This is the entrypoint to the spider
func (s *Spider) StartRequest(ctx context.Context, job *Job) {
  // for each request we must call NewRequest() and never reuse it
  req := s.NewRequest()

  var headers http.Header

  /* GET is the request method, method chaining possible
  req.Url("<URL_HERE>").
  Meta("MY_KEY1", "MY_VALUE").
  Meta("MY_KEY2", true).
  Header(headers)
  */
    
  /* POST
  req.Url(<URL_HERE>)
  req.Method("POST")
  req.Body(<BODY_HERE>)
  */
    
  // call the next parse method
  s.Request(req, s.parse)
}

// can be called when spider exits
func (s *Spider) Close(ctx context.Context) {
}

func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) {
  // response.Body()
  // response.StatusCode()
  // response.Header()
  // response.Bytes()
  // response.Meta("MY_KEY1")
	
  // yielding output pushes output to be processed by pipelines, also check output.go for the fields
  var data Record

  err := json.Unmarshal(resp.Bytes(), &data)
  if err != nil {
    log.Panicln(err)
  }

  // s.Yield(&data)
}

Settings

In addition to all the files discussed, we also have settings.go where we can import all middlewares and pipelines we want to use in our project.

settings.go

// HTTP Transport settings

// Default: 10000
const MIDDLEWARE_HTTP_TIMEOUT_MS = ""

// Default: 100
const MIDDLEWARE_HTTP_MAX_IDLE_CONN = ""

// Default: 100
const MIDDLEWARE_HTTP_MAX_CONN_PER_HOST = ""

// Default: 100
const MIDDLEWARE_HTTP_MAX_IDLE_CONN_PER_HOST = ""

// Inbuilt Retry middleware settings

// Default: 3
const MIDDLEWARE_HTTP_RETRY_MAX_RETRIES = ""

// Default: 500, 502, 503, 504, 522, 524, 408, 429
const MIDDLEWARE_HTTP_RETRY_CODES = ""

// Default: 1s
const MIDDLEWARE_HTTP_RETRY_BASE_DELAY = ""

// Default: 1000000
const SCHEDULER_REQ_RES_POOL_SIZE = ""

// Default: num. of CPU * 3
const SCHEDULER_CONCURRENCY = ""

// Default: 1000000
const SCHEDULER_WORK_QUEUE_SIZE = ""

// Pipeline Manager settings

// Default: 10000
const PIPELINEMANAGER_ITEMPOOL_SIZE = ""

// Default: 24
const PIPELINEMANAGER_ITEM_SIZE = ""

// Default: 0
const PIPELINEMANAGER_OUTPUT_QUEUE_BUF_SIZE = ""

// Default: 1000
const PIPELINEMANAGER_MAX_PROCESS_ITEM_CONCURRENCY = ""

// Middlewares here
var MIDDLEWARES = []middlewaremanager.Middleware{
	middlewares.Retry(),
	middlewares.MultiCookieJar,
	middlewares.DupeFilter,
}

var export2CSV = pipelines.Export2CSV[*Record](pipelines.Export2CSVOpts{
	Filename: "itstimeitsnowornever.csv",
})

// Pipelines here
var PIPELINES = []pm.IPipeline[*Record]{
	export2CSV,
	// export2Json,
}
...

Examples

scrapejsp - api scraping
scrapejsp_method2 [This new method is recommended] - api scraping
books.toscrape.com - html scraping

More examples coming...

Usage

main.go.

func main() {
  ctx, cancel := context.WithCancel(context.Background())

  var wg sync.WaitGroup
  wg.Add(1)

  spider, errCh := test1.New(ctx)
  go func() {
	defer wg.Done()

	err := <-errCh

	if err != nil && errors.Is(err, context.Canceled) {
		return
	}

	fmt.Printf("failed: %q", err)
  }()

  // start the scraper with a job, currently nil is passed but you can pass your job here
  spider.StartRequest(ctx, nil)

  OnTerminate(func() {
	fmt.Println("exit signal received: shutting down gracefully")
	cancel()
	wg.Wait()
  })

}

Customize the Default client.

Option	Description	Default
WithProxies	Accepts multiple proxy url strings.	by default client uses proxy from enviroment
WithTimeout	Http client timeout.	10 seconds
WithMaxIdleConns	Controls the max no. of idle(keep-alive) conns. across all hosts. 0 means unlimited.	100
WithMaxIdleConnsPerHost	Same as WithMaxIdleConns but per host.	100
WithMaxConnsPerHost	Limits the total no. of conns. per host. 0 mean unlimited.	100
WithProxyFn	Accepts a custom proxy function for transport.	Round robin

[spider.go]

func New(ctx context.Context) (*Spider, <-chan error) {
    // default client options
    // proxies := gos.WithProxies("proxy_url1", "proxy_url2", ...)
     
    // core := gos.New[*Record]().WithClient(
    // 	  gos.DefaultClient(proxies),
    // )

    // we can also provide in our custom client
    // core := gos.New[*Record]().WithClient(myCustomHTTPClient)
}

Pipelines

Pipelines help in managing, transforming, and fine-tuning the scraped data.

Built-in Pipelines

Use Pipelines

We can add pipelines using coreSpider.PipelineManager.Add().

[settings.go]

// use export 2 csv pipeline
export2Csv := pipelines.Export2CSV[*scrapejsp.Record](pipelines.Export2CSVOpts{
	Filename: "itstimeitsnowornever.csv",
})

// use export 2 json pipeline
export2Json := pipelines.Export2JSON[*scrapejsp.Record](pipelines.Export2JSONOpts{
	Filename:  "itstimeitsnowornever.json",
	Immediate: true,
})

Pipeline Group

A Group allows us to execute multiple pipelines concurrently. All pipelines in a group behave as one single pipeline. This can be useful in scenarios we may want to export our data both to multiple destinations. Instead of exporting sequentially, we can bundle them together in a group.

Pipelines in a group shouldn't be used for data transformation but for independent tasks like data exporting to a database etc.

[settings.go]

func myCustomPipelineGroup() *pm.Group[*Record] {
  // create a group
  pipelineGroup := pm.NewGroup[*Record]()

  pipelineGroup.Add(export2CSV)
  // pipelineGroup.Add(export2Json)
  return pipelineGroup
}

// Pipelines here
// Executed in the order they appear.
var PIPELINES = []pm.IPipeline[*Record]{
  export2CSV,
  // export2Json,
  // myCustomPipelineGroup(), // use group as if it were a single pipeline
}

Middlewares

GoScrapy also support inbuilt and custom middlewares for manipulation outgoing request.

Built-in Middlewares

MultiCookieJar - used for maintaining different cookie sessions while scraping.
DupeFilter - filters duplicate requests
Retry - retry request with exponential back-off upon failure or with http status codes 500, 502, 503, 504, 522, 524, 408, 429

Option	Description	Default
MaxRetries	Additional retries after failure.	3
Codes	Http code to trigger retry.	500, 502, 503, 504, 522, 524, 408, 429
BaseDelay	Exponential Backoff multiplier.	1 second
Cb	Callback executed after every retry. If callback returns false, further retry is skipped.	nil

Use Middlewares

We can add middlewares using gos.MiddlewareManager.Add().

[settings.go]

var MIDDLEWARES = []middlewaremanager.Middleware{
	middlewares.Retry(),
	middlewares.MultiCookieJar,
	middlewares.DupeFilter,
}

Custom Pipelines

GoScrapy supports custom pipelines. To create one, you can use goscrapy cli.

abc\go\go-test-scrapy>scrapejsp> goscrapy pipeline export_2_DB

✔️  pipelines\export_2_DB.go

✨ Congrates, export_2_DB created successfully.

Custom middlewares

To create one, you can use goscrapy cli. Custom middlewares must have the below function signature.

func MultiCookieJar(next http.RoundTripper) http.RoundTripper {
	return core.MiddlewareFunc(func(req *http.Request) (*http.Response, error) {
		// you middleware custom code here
	})
}

Selectors

GoScrapy supports CSS and XPATH selectors.

[spider.go]

func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) {

        // CSS selector - select all products a tags and extract the href attribute value
        var productUrls []string
        productUrls = resp.Css("article.product_pod h3 a").Attr("href")

        // select all the text node values
        var productNames []string
        productNames = resp.Css("article.product_pod h3 a").Text()

        // Selector chaining is possible too
        productUrls = resp.Css("article.product_pod").Css("h3 a").Attr("href")

        // Xpath selector
        productUrls = resp.Xpath("//article[contains(@class, 'product_pod')]//h3//a").Attr("href")

        // chaining xpath and css also possible
        productUrls = resp.Xpath("//article[contains(@class, 'product_pod')]").Css("h3 a").Attr("href")


        // Get all matching nodes
        var productUrlNodes []*html.Node
        productUrlNodes = resp.Css("article.product_pod h3 a").GetAll()

        // Get the first matching node
        var firstProductUrlNode *html.Node
        firstProductUrlNode = resp.Css("article.product_pod h3 a").Get()
}

Get in touch

Discord

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

GoScrapy: Wiki

Prerequisites.

Goscrapy cli.

Base Concepts

Job

Record

Spider

Settings

Examples

Usage

Customize the Default client.

Pipelines

Built-in Pipelines

Use Pipelines

Pipeline Group

Middlewares

Built-in Middlewares

Use Middlewares

Custom Pipelines

Custom middlewares

Selectors

Get in touch

Clone this wiki locally