The New York Times Annotated Corpus
- Last Update:October,27,2015 Created:October,27,2015
- Comment
- Like
- Favorite
Public
Profile
Title of the dataset | The New York Times Annotated Corpus |
---|---|
Provenance of the dataset | https://catalog.ldc.upenn.edu/LDC2008T19 |
How were the data collected/created? What was the cost? | ニューヨーク・タイムズの過去の記事データを整理。タグ付けなどは一部自動だが、手動で行っているものもある。 |
Data sharing policy | Other |
Data sharing policy |
About data analysis and simulation
Type of data: Check all that apply. Use "Other" to specify other types so that we can include them in further updates. | graph text number series |
---|---|
Variable labels of dataset (the names of the variables) | ARTICLE_ABSTRACT|ONLINE_LEAD_PARAGRAPH|TAXONOMIC_CLASSIFIERS|SECTION|ALTERNATE_URL|ORGANIZATIONS|NORMALIZED_BYLINE|PUBLICATION_DAY_OF_MONTH|ONLINE_LOCATIONS|NAMES|BYLINE|WORD_COUNT|TYPES_OF_MATERIAL|COLUMN_NAME|FEATUREPAGE|URL|DATELINE|HEADLINE|COLUMN_NUMBER|ONLINE_SECTION|PEOPLE|PUBLICATION_MONTH|ONLINE_HEADLINE|LEAD_PARAGRAPH|NEWS_DESK|BANNER(ADDITIONAL_INFORMATION_APPENDED_TO_THE_ARTICLES)|PAGE|SLUG|TITLES|ONLINE_PEOPLE|ONLINE_TITLES|BODY(THE_TEXT_CONTENT_OF_THE_ARTICLE)|SERIES_NAME|DAY_OF_WEEK|AUTHOR_BIOGRAPHY|CORRECTION_TEXT|ONLINE_DESCRIPTORS|PUBLICATION_DATE|CORRECTION_DATE|DESCRIPTORS|ONLINE_ORGANIZATIONS|GENERAL_ONLINE_DESCRIPTORS|LOCATIONS|PUBLICATION_YEAR|GUID|CREDIT|BIOGRAPHICAL_CATEGORIES(HAND-ASSIGNED_TAG)|KICKER |
Outline of data | The New York Timesの英語ニュース記事のアーカイブである。The New York Times Annotated Corpusは1987年1月1日から2007年6月19日までに出版された約180万件のニューヨーク・タイムズの記事、メタデータを提供している。 利用価格は300USD。 コーパスが含むデータの詳細は以下。 ・Over 1.8 million articles (excluding wire services articles that appeared during the covered period). ・Over 650,000 article summaries written by library scientists. ・Over 1,500,000 articles manually tagged by library scientists with tags drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors. ・Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at nytimes.com. ・Java tools for parsing corpus documents from .xml into a memory resident object.? 変数の詳細情報はマニュアル(https://catalog.ldc.upenn.edu/docs/LDC2008T19/new_york_times_annotated_corpus.pdf)に記載されている。 |
Simulation process | summarization(文書要約)、metadata extraction(メタデータ抽出)、information retrieval(情報検索)、information extraction(情報抽出) |
Expected outcome of the process (obtained knowledge, analysis results, output of tools) | |
Anticipation for analyses/simulations other than the typical ones provided above |
Other
Comments | http://qiita.com/yubessy/items/58f5a1c6749a65ba0995 上記の参考サイトによると、データは下記の状態にあるようである。 記事数:1855658件 記事ID:0000000 - 1855670 欠如している記事ID:48372, 51952, 69594, 81513, 113822, 288553, 858493, 858494, 858495, 858496, 858498, 858499, 1685651 利用マニュアルは以下 https://catalog.ldc.upenn.edu/docs/LDC2008T19/new_york_times_annotated_corpus.pdf |
---|---|
What kind of data/tools do you wish to have? | |
Visualized information | |
Sample data |
Comment form