Login/
Member

Login
Member

The New York Times Annotated Corpus

Last Update：October,27,2015　Created：October,27,2015
Comment
Like
Favorite

Public

Profile

Title of the dataset	The New York Times Annotated Corpus
Provenance of the dataset	https://catalog.ldc.upenn.edu/LDC2008T19
How were the data collected/created? What was the cost?	ニューヨーク・タイムズの過去の記事データを整理。タグ付けなどは一部自動だが、手動で行っているものもある。
Data sharing policy	Other
Data sharing policy

About data analysis and simulation

Type of data: Check all that apply. Use "Other" to specify other types so that we can include them in further updates.	graph text number series
Variable labels of dataset (the names of the variables)	ARTICLE_ABSTRACT\|ONLINE_LEAD_PARAGRAPH\|TAXONOMIC_CLASSIFIERS\|SECTION\|ALTERNATE_URL\|ORGANIZATIONS\|NORMALIZED_BYLINE\|PUBLICATION_DAY_OF_MONTH\|ONLINE_LOCATIONS\|NAMES\|BYLINE\|WORD_COUNT\|TYPES_OF_MATERIAL\|COLUMN_NAME\|FEATUREPAGE\|URL\|DATELINE\|HEADLINE\|COLUMN_NUMBER\|ONLINE_SECTION\|PEOPLE\|PUBLICATION_MONTH\|ONLINE_HEADLINE\|LEAD_PARAGRAPH\|NEWS_DESK\|BANNER(ADDITIONAL_INFORMATION_APPENDED_TO_THE_ARTICLES)\|PAGE\|SLUG\|TITLES\|ONLINE_PEOPLE\|ONLINE_TITLES\|BODY(THE_TEXT_CONTENT_OF_THE_ARTICLE)\|SERIES_NAME\|DAY_OF_WEEK\|AUTHOR_BIOGRAPHY\|CORRECTION_TEXT\|ONLINE_DESCRIPTORS\|PUBLICATION_DATE\|CORRECTION_DATE\|DESCRIPTORS\|ONLINE_ORGANIZATIONS\|GENERAL_ONLINE_DESCRIPTORS\|LOCATIONS\|PUBLICATION_YEAR\|GUID\|CREDIT\|BIOGRAPHICAL_CATEGORIES(HAND-ASSIGNED_TAG)\|KICKER
Outline of data	The New York Timesの英語ニュース記事のアーカイブである。The New York Times Annotated Corpusは1987年1月1日から2007年6月19日までに出版された約180万件のニューヨーク・タイムズの記事、メタデータを提供している。利用価格は300USD。コーパスが含むデータの詳細は以下。・Over 1.8 million articles (excluding wire services articles that appeared during the covered period). ・Over 650,000 article summaries written by library scientists. ・Over 1,500,000 articles manually tagged by library scientists with tags drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors. ・Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at nytimes.com. ・Java tools for parsing corpus documents from .xml into a memory resident object.? 変数の詳細情報はマニュアル（https://catalog.ldc.upenn.edu/docs/LDC2008T19/new_york_times_annotated_corpus.pdf）に記載されている。
Simulation process	summarization（文書要約）、metadata extraction（メタデータ抽出）、information retrieval（情報検索）、information extraction（情報抽出）
Expected outcome of the process (obtained knowledge, analysis results, output of tools)
Anticipation for analyses/simulations other than the typical ones provided above

Other

Comments	http://qiita.com/yubessy/items/58f5a1c6749a65ba0995 上記の参考サイトによると、データは下記の状態にあるようである。記事数：1855658件記事ID：0000000 - 1855670 欠如している記事ID：48372, 51952, 69594, 81513, 113822, 288553, 858493, 858494, 858495, 858496, 858498, 858499, 1685651 利用マニュアルは以下 https://catalog.ldc.upenn.edu/docs/LDC2008T19/new_york_times_annotated_corpus.pdf
What kind of data/tools do you wish to have?
Visualized information
Sample data

Latest DJs

日照量と株価の相関性の検証

February,13,2022

気候変動から影響を受ける可能性の高い銘柄の検知

February,13,2022

nbaにおける各地点でのシュート軌道と成功率の関係

January,14,2022

SNSでのデマの拡散に関するデータ

January,12,2022

国民幸福度データ

January,12,2022

Comment form コメントをキャンセル

関連するトピック

関連するトピックはありません。

このDJのトピックを投稿