Go 网络爬虫-谢先斌的博客

Go 网络爬虫

发布时间： 2022-04-15 更新时间： 2025-10-19 总字数：332 阅读时间：1m 作者：谢先斌 IP上海

Golang中网络爬虫主要使用goquery包，它提供了发起HTTP请求、并对返回HTML解析的功能。

介绍

什么是爬虫

爬虫是按照一定规则自动的获取互联网上的信息，一般在搜索引擎、web扫描等

html 相关介绍

goquery 介绍

goquery 包提供提供了发起HTTP请求，并对返回的HTML进行解析

发起请求：func NewDocumentFromReader(r io.Reader) (*Document, error)
解析
- Find 查找元素
  - document.Find(“a”)
- ChildrenFiltered 查找子元素
- Text 获取文本内容
- Html 获取HTML内容
- Attr 获取属性
- Each 遍历
选择器，参考 jquery 实现
- html tag
- css .class
- css #id
- 以及上述的组合的复合选择器
  - div.class
  - span.id
- 子孙选择器
  - selector1 selector2 selector3 …
- 子选择器：selector1 > selector2 > …
  - document.Find(“selector1”).ChildrenFiltered(selector2)

示例

package main

import (
	"fmt"
	"net/http"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	req, err := http.NewRequest("GET", "https://github.com/xiexianbin", nil)
	if err != nil {
		fmt.Println(err)
	}

	req.Header.Set("authority", "github.com")
	req.Header.Set("method", "GET")
	req.Header.Set("path", "/xiexianbin")
	req.Header.Set("scheme", "https")
	req.Header.Set("user-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36")

	//req.Header.Set("cookie", "")
	client := &http.Client{Transport: http.DefaultTransport}

	resp, err := client.Do(req)
	if err != nil {
		fmt.Println(err.Error())
	}

	document, err := goquery.NewDocumentFromReader(resp.Body)
	if err != nil {
		fmt.Println(err.Error())
	}
	document.Find(".js-pinned-items-reorder-container .js-pinned-items-reorder-list .flex-content-stretch").Each(func(i int, selection *goquery.Selection) {
		fmt.Println("---", i, "---")
		fmt.Println(selection.Find("span.repo").Text())
	})
}
source repo / download raw

Go 网络爬虫

介绍

什么是爬虫

goquery 介绍

示例

Cookie Notice!