在本文中,我们将学习通过使用Rust进行网页数据抓取,文末有完整代码。
cargo new rust_tutorial
然后我们必须安装两个Rust库,这两个库将在本教程中使用。
[dependencies]
reqwest = "0.10.8"
scraper = "0.12.0"
tokio = { version = "1.25.0", features = ["full"] }
价格存储在类price_color的标记下。现在,让我们用rust编写代码并提取数据。
use reqwest::Client;
use scraper::{Html, Selector};
使用reqwest,我们将建立到主机网站的HTTP连接,使用scraper库,我们将解析HTML内容,我们将通过reqwest库发出GET请求。
let client = Client::new();
let res = client.get("http://books.toscrape.com/").send().await?;
这里我们使用mut修饰符将值绑定到变量。这提高了代码的可读性,一旦你在将来改变了这个值,你可能不得不改变代码的其他部分。
let body = res.text().await?;
这里res.text().unwrap()将返回一个HTML字符串,我们将该字符串存储在body变量中。
let document = Html::parse_document(&body);
现在,这个对象可以用于选择元素和导航到所需的元素。
let book_title_selector = Selector::parse("h3 > a").unwrap();
正如你上图中看到的,目标a标记是h3标记的子标记。因此,我们在上面的代码中使用了h3 > a。
for book_title in document.select(&book_title_selector) {
let title = book_title.text().collect::<Vec<_>>();
println!("Title: {}", title[0]);
}
select方法将为我们提供与选择器book_title_selector匹配的元素列表。然后遍历该列表以找到title属性并最终打印它。
这里Vec<_>>表示一个动态大小的数组。它是一个向量,你可以通过它在向量中的位置访问任何元素。
let book_price_selector = Selector::parse(".price_color").unwrap();
我们再次使用Selector::parse函数创建scraper::Selector对象。如上所述,价格存储在price_color类下。我们把这个作为CSS选择器传递给了parse函数。
for book_price in document.select(&book_price_selector) {
let price = book_price.text().collect::<Vec<_>>();
println!("Price: {}", price[0]);
}
一旦你找到匹配的选择器,它将获得文本并打印到控制台上。
Title: A Light in the ...
Title: Tipping the Velvet
Title: Soumission
Title: Sharp Objects
Title: Sapiens: A Brief History ...
Title: The Requiem Red
Title: The Dirty Little Secrets ...
Title: The Coming Woman: A ...
Title: The Boys in the ...
Title: The Black Maria
Title: Starving Hearts (Triangular Trade ...
Title: Shakespeare's Sonnets
Title: Set Me Free
Title: Scott Pilgrim's Precious Little ...
Title: Rip it Up and ...
Title: Our Band Could Be ...
Title: Olio
Title: Mesaerion: The Best Science ...
Title: Libertarianism for Beginners
Title: It's Only the Himalayas
Price: £51.77
Price: £53.74
Price: £50.10
Price: £47.82
Price: £54.23
Price: £22.65
Price: £33.34
Price: £17.93
Price: £22.60
Price: £52.15
Price: £13.99
Price: £20.66
Price: £17.46
Price: £52.29
Price: £35.02
Price: £57.25
Price: £23.88
Price: £37.59
Price: £51.33
Price: £45.17
use reqwest::Client;
use scraper::{Html, Selector};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create a new client
let client = Client::new();
// Send a GET request to the website
let res = client.get("http://books.toscrape.com/")
.send().await?;
// Extract the HTML from the response
let body = res.text().await?;
// Parse the HTML into a document
let document = Html::parse_document(&body);
// Create a selector for the book titles
let book_title_selector = Selector::parse("h3 > a").unwrap();
// Iterate over the book titles
for book_title in document.select(&book_title_selector) {
let title = book_title.text().collect::<Vec<_>>();
println!("Title: {}", title[0]);
}
// Create a selector for the book prices
let book_price_selector = Selector::parse(".price_color").unwrap();
// Iterate over the book prices
for book_price in document.select(&book_price_selector) {
let price = book_price.text().collect::<Vec<_>>();
println!("Price: {}", price[0]);
}
Ok(())
}