Seminar @ NTU
- Parsing: 將字串做分析後，mapping 到某一資料結構上 (通常為 syntax trees) 的過程
- Semantic Parsing: 將 NL 轉換為 MRL 的過程
- MRL: Meaning Representation Language
- Training Set: 在系統開發期間，被拿來餵進系統的訓練資料集
- Test Set: 相對於 training set，test set 是在系統發展後，拿來評估系統效能的資料集
- Cross-Validation: ..囧?
- N-fold Cross-Validation: 將資料分為 N 份，做 N 次評估並取平均值，每一次取其中一組作為 test set
More about Semantic Parsing
- Natural Language 的語法不如 Programming Language 嚴謹，常會有 ambiguous 的狀況出現
- Eg. “I saw a man with a telescope.” 有兩種合理解釋
- Training Data 越大，對於該訓練領域的 precision 越高，但可能會因取樣不足或不夠廣泛而出現系統偏頗的現象
To done deep-understanding, you have to solve the SGP.
There are 3 types of information needs:
- Informational: Queries about “How”, “What”, etc.
- Transactional: Queries about information or financial tradings
- Navigational: Queries of finding a specific website
What’s the major when it comes to searching in the field of blogs?
Ans. “Informational,” especially queries about opinions.
There are some methods to obtain the blog data:
- XML Feeds: RSS, ATOM, or something like that
- HTML: in comparision with XML, it has less semantical information with its tags
Some problems when collecting data
- There might hundreds of thousand requests within a day
- How to tell which blog is a sblog or not, and there might be garbages anywhere in the pages
More about “Recall” and “Precision”
If the IR system returns 320 entities it might be about zoo,
and after we checked it through, 250 relatived,
and in the corpus, 400 entities are exactly relative.
The Recall will be: 250 / 400 = 0.625,
and the Precision is: 250 / 320 = 0.781
To make it clear,
Recall = (# of correct data returned) / (# of all correct data in corpus)
Precision = (# of correct data returned) / (# of data returned)
It will be a trade-off, when you want to improve one of them.
However, the higher they are, the better the system is.