# 网络爬虫

这里存储一些爬虫需要注意的点

### 爬虫不能忘记的几个点

* 标签查找要使用稳定的方法
* `selenium`设置隐式等待优于`sleep`

### Cookie

将cookie设置另做函数是一种推荐的做法

```python
def set_headers(cookieFileName):
    """
    设置请求头，主要是cookie的设置，需要登录企查查后，进入开发者工作获取。
    :param cookieFileName: 存储cookie的文件例如：'cookie.txt',str
    :return: 请求头header,dict
    """
    with open(cookieFileName, 'rt', encoding='utf-8') as f:
        cookie = f.read().strip()
    header = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:61.0) Gecko/20100101 Firefox/61.0',
        'Cookie': cookie
    }
    print('headers设置成功')
    return header
```

### 爬外网时需要注意自己的梯子

* `requests`设置代理

  ```python
  import requests
  #ssr配置代理，端口查看梯子软件
  proxies={'http': 'http://127.0.0.1:10619', 'https': 'http://127.0.0.1:10619'}
  response = requests.get(Url,proxies=proxies)
  html = response.content.decode('utf-8', 'ignore')
  soup = BeautifulSoup(html, features="lxml")
  ```
* `selenuim`设置代理

  ```python
  from selenium.webdriver import Chrome, ChromeOptions # 导入类库
  option = ChromeOptions() # 初始化类
  ip = "127.0.0.1"
  port = "10619"
  # 设置代理
  option.add_argument("--proxy-server=http://{}:{}".format(ip, port))
  driver = Chrome(options=option)  # 模拟开浏览器
  driver.get(Url) # 跳转网址
  ```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://euclid-jie.gitbook.io/euclid_jie_s_book/pa-chong-xiang-guan.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
