Content Scraping Principle
In the src/contents/scraper.ts
file, we define the scraper
logic for getting webpage content when publishing articles.
Similarly, we listen for messages from the Options
page. When users click the Get Content
button in the Article
tab, it triggers this message and calls the scrapeContent
function to get webpage content.
By default, we use the defaultScraper
function to get webpage content, and it determines which scraper
function to use based on the webpage URL.
For example, https://blog.csdn.net/
will use the scrapeCSDNContent
function to get webpage content.
Taking CSDN
as an example, we use the scrapeCSDNContent
function to get webpage content. The principle is to use the Readability
library to get webpage content, use the preprocessor
function to process webpage content, and finally use different selectors to get article title, author, cover, content, summary and other information based on the characteristics of different types of websites.