Go开发 之 利用 Xpath 读取网页中想要的内容

1、唠唠叨叨

在我之前的文章中讲述过如何利用xpath来抓取网页的内容,不过之前的开发语言我都是使用的Python。如果GoLang也想用Xpath该如何使用呢?下面就来简单的讲一下如何用Golang爬取github.com的一些小内容来做切入点学会这个知识吧。

回顾一下:

2、先看一下效果

比如爬取github.com的这几个控件的内容
在这里插入图片描述
爬取后得到的内容:
在这里插入图片描述

3、项目所需包

先安装项目所需的包,再说其它:

https://github.com/pkg/errors  
https://github.com/lestrrat-go/libxml2 

4、核心代码

/**
Author: 沙振宇
CreateTime:2019-12-4
UpdateTime:2019-12-4
Info:	“利用 Xpath 读取 html 内容”
		“通过Github来演示Xpath如何使用”
 */
package main

import (
	"fmt"
	"github.com/lestrrat-go/libxml2"
	"github.com/lestrrat-go/libxml2/xpath"
	"net/http"
	"strings"
)

// 去除空格和换行符
func getReplace(str string)  string{
	// 去除空格
	str = strings.Replace(str, " ", "", -1)
	// 去除换行符
	str = strings.Replace(str, "\n", "", -1)
	return str
}

func main() {
	urlPath := "https://github.com/"
	res, err := http.Get(urlPath)
	if err != nil {
		panic("failed to get : " + err.Error())
	}

	doc, err := libxml2.ParseHTMLReader(res.Body)
	if err != nil {
		panic("failed to parse HTML: " + err.Error())
	}
	defer doc.Free()

	nodes := xpath.NodeList(doc.Find(`//summary[@class="HeaderMenu-summary HeaderMenu-link px-0 py-3 border-0 no-wrap d-block d-lg-inline-block"]/text()`))

	fmt.Printf("nodes type: %T,len: %d\n\n", nodes, len(nodes))

	for i := 0; i < len(nodes); i++ {
		fmt.Printf("nodes type: %T,text: %s\n", nodes[i], getReplace(nodes[i].String()))
	}
}

5、Github源码分享

https://github.com/ShaShiDiZhuanLan/Demo_Xpath_Go

6、其它小知识

6.1、git代码回滚

再用git上传代码导github中,遇到了一个问题,就是我想要删除一些没必要的commit记录。于是我把代码回滚到之前的记录,然后commit一下。查看本地库已经删除了,但push时发现怎么也push不上去,后来直接使用强制提交git push -f才提交成功。

# 注: n代表想往前回滚的次数
git reset --hard HEAD~n

# 强制提交
git push -f -u origin 分支名

6.2、github访问过慢

github.com访问有时过慢,于是利用“http://tool.chinaz.com/dns”查询了DNS链接最快的响应IP:
在这里插入图片描述
把这个IP配置到hosts中,
windows的话在:“C:\Windows\System32\drivers\etc\hosts”中配置,
linux的话再:“/etc/hosts”中配置

例如上述,把以下配置到hosts末尾即可:

52.74.223.119 github.com 
<div class="post-text" itemprop="text"> <p>I am parsing a xml document using <a href="http://gopkg.in/xmlpath.v2" rel="nofollow noreferrer">gopkg.in/xmlpath.v2</a>, and I am finding a trouble... I have no problem to get info from a single node, or get a iterator and loop over its items getting their info. But, the case where I am blocked is when I try to get the info from the same node on which I am iterating. I think that an example will be illuminating.<br> This is the XML:</p> <pre><code><Warnings> <Warning Type="309" ShortText="Unfulfilled Paid Service 1">Unable to book seat 1</Warning> <Warning Type="309" ShortText="Unfulfilled Paid Service 2">Unable to book seat 2</Warning> <Warning Type="309" ShortText="Unfulfilled Paid Service 3">Unable to book seat 3</Warning> </Warnings> </code></pre> <p>These are the xpath that I am usin:</p> <pre><code>xpath := xPathWarning{ WarningsBase: "Warnings/Warning", Warning: "", WarningAttr: "@ShortText", } </code></pre> <p>And this is the way that I am trying to get the value and attribute:</p> <pre><code>func getWarnings(root *xmlpath.Node, xpath xpath_OC) []Warning { warnings := []Warning{} v, _ := xmlpath.Compile(xpath.WarningsBase) WarningsBaseIter := v.Iter(root) for WarningsBaseIter.Next() { rawOffer := WarningsBaseIter.Node() warning := Warning{ Value: GetString("", xpath.Warning, rawOffer), Attr: GetString("", xpath.WarningAttr, rawOffer), } warnings = append(warnings, warning) } return warnings } func GetString(ExpressionPrefix string, XPathExpression string, Node *xmlpath.Node) string { Expr := []string{ExpressionPrefix, XPathExpression} pathString := strings.Join(Expr, "") if pathString != "BLANK" { Path, err := xmlpath.Compile(pathString) if err == nil { Value, _ := Path.String(Node) return Value } } return "" } </code></pre> <p>I am able to get the ShortText, but not the value. I had been checking the error, showing on terminal the result of <code>Path, err := xmlpath.Compile(pathString)</code>, and the err shown is <code>compiling xml path "":0: empty path</code>. </p> <p>Any solution for this??<br> Thanks.</p> </div>
©️2020 CSDN 皮肤主题: 代码科技 设计师:Amelia_0503 返回首页