[X] Close
You are about to erase all the values you have customized, search history, page format, etc.
Click here to RESET all values       Click here to GO BACK without resetting any value
Item 1 of about 1
1. Boyack KW, Newman D, Duhon RJ, Klavans R, Patek M, Biberstine JR, Schijvenaars B, Skupin A, Ma N, Börner K: Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLoS One; 2011 Mar 17;6(3):e18029
PDF icon [Fulltext service] Download fulltext PDF of this article and others, as many as you want.

  • [Source] The source of this record is MEDLINE®, a database of the U.S. National Library of Medicine.
  • [Title] Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches.
  • BACKGROUND: We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents.
  • Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis.
  • The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results.
  • Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.
  • METHODOLOGY: We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings.
  • The nine approaches were comprised of five different analytical techniques with two data sources.
  • The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models--BM25 and PMRA (PubMed Related Articles).
  • The two data sources were a) MeSH subject headings, and b) words from titles and abstracts.
  • Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering.
  • Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.
  • CONCLUSIONS: PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts.
  • Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.

  • COS Scholar Universe. author profiles.
  • NCI CPTAC Assay Portal. NCI CPTAC Assay Portal .
  • eScholarship, California Digital Library, University of California. Full text from University of California eScholarship .
  • [Email] Email this result item
    Email the results to the following email address:   [X] Close
  • [Cites] Proc Natl Acad Sci U S A. 2004 Apr 6;101 Suppl 1:5214-9 [15037748.001]
  • [Cites] PLoS Biol. 2004 Nov;2(11):e309 [15383839.001]
  • [Cites] J Am Med Inform Assoc. 2005 Mar-Apr;12(2):207-16 [15561789.001]
  • [Cites] Bioinformatics. 2006 Sep 15;22(18):2298-304 [16926219.001]
  • [Cites] Bioinformatics. 2008 Sep 1;24(17):1935-41 [18593717.001]
  • [Cites] BMC Bioinformatics. 2007;8:423 [17971238.001]
  • [Cites] BMC Bioinformatics. 2008;9:108 [18284683.001]
  • [Cites] BMC Genomics. 2008;9 Suppl 1:S10 [18366599.001]
  • [Cites] J Biomed Inform. 2007 Apr;40(2):114-30 [16996316.001]
  • (PMID = 21437291.001).
  • [ISSN] 1932-6203
  • [Journal-full-title] PloS one
  • [ISO-abbreviation] PLoS ONE
  • [Language] ENG
  • [Grant] United States / NHLBI NIH HHS / HL / HHSN268200900053C; United States / PHS HHS / / HHSN268200900053C
  • [Publication-type] Comparative Study; Journal Article; Research Support, N.I.H., Extramural; Research Support, Non-U.S. Gov't
  • [Publication-country] United States
  • [Other-IDs] NLM/ PMC3060097
  •  go-up   go-down


Advertisement





Advertisement