20070802_Notes

因為有朋友說筆記要好好記,所以我就開了這個新的分類,
希望之後可以陸續增加新的東西 @@


Seminar @ NTU

Keywords

  • Parsing: 將字串做分析後,mapping 到某一資料結構上 (通常為 syntax trees) 的過程
  • Semantic Parsing: 將 NL 轉換為 MRL 的過程
  • MRL: Meaning Representation Language
  • Training Set: 在系統開發期間,被拿來餵進系統的訓練資料集
  • Test Set: 相對於 training set,test set 是在系統發展後,拿來評估系統效能的資料集
  • Cross-Validation: ..囧?
  • N-fold Cross-Validation: 將資料分為 N 份,做 N 次評估並取平均值,每一次取其中一組作為 test set

More about Semantic Parsing

  • Natural Language 的語法不如 Programming Language 嚴謹,常會有 ambiguous 的狀況出現
    • Eg. “I saw a man with a telescope.” 有兩種合理解釋
  • Training Data 越大,對於該訓練領域的 precision 越高,但可能會因取樣不足或不夠廣泛而出現系統偏頗的現象

SGP: Symbol Grounding Problem

To done deep-understanding, you have to solve the SGP.


There are 3 types of information needs:

  • Informational: Queries about “How”, “What”, etc.
  • Transactional: Queries about information or financial tradings
  • Navigational: Queries of finding a specific website

What’s the major when it comes to searching in the field of blogs?
Ans. “Informational,” especially queries about opinions.

There are some methods to obtain the blog data:

  • XML Feeds: RSS, ATOM, or something like that
  • HTML: in comparision with XML, it has less semantical information with its tags

Some problems when collecting data

  • There might hundreds of thousand requests within a day
  • How to tell which blog is a sblog or not, and there might be garbages anywhere in the pages

More about “Recall” and “Precision”

If the IR system returns 320 entities it might be about zoo,
and after we checked it through, 250 relatived,
and in the corpus, 400 entities are exactly relative.

The Recall will be: 250 / 400 = 0.625,
and the Precision is: 250 / 320 = 0.781

To make it clear,
Recall = (# of correct data returned) / (# of all correct data in corpus)
Precision = (# of correct data returned) / (# of data returned)

It will be a trade-off, when you want to improve one of them.
However, the higher they are, the better the system is.