University of Notre Dame News: A Reading

I have done a bit of analysis -- reading -- against the set of news distributed by the University of Notre Dame, and below is some of what I learned.

Introduction

A lot of the University's news is distributed via the Web, and the root of the news is located at https://news.nd.edu. Late last calendar year I used program called wget to crawl the news site and cache the result. Upon closer inspection of the cache, I noticed how some of the Web pages were echoed and indexed in a number of auxiliary pages. I deleted the echoes and index pages, and I copied all of the news stories to a single directory. I then applied a tool called the Distant Reader Toolbox against the directory. This resulted in a data set of news stories which I proceeded to analyze.

There are about 10,000 articles in the data set, which includes articles from about the past two decades, I think. The data set is about 6.1 million words long, which is bigger rather than smaller. By comparison, Moby Dick is about .25 million words long, and the Bible is about .8 million words long.

Now that I have the data, what do I want to know? Well, my research questions are, "What does the news discuss? Who and what is in the news? If I were to enumerate the themes of the news, then what might those themes be, and to what degree are they mentioned compared to the other themes?" In the end, I really want to know, "What are the University's emphases?"

Simple frequencies

One way to address these questions is to observe simple frequencies. For example, illustrations of the most common words, two-word phrases, nouns, pronouns, named-entities, and statistically significant keywords all allude to what is brought to light in the news. If you ignore things like "facebook", "youtube", "twitter", and "grace hall" (which are all a part of the news feeds' boilerplate texts), then things like research, Catholicism, and awards present themselves. Upon closer inspection, a few disciplines present themselves. Medicine, law, religion, and business are good examples. See below:


most common words

most common two-word phrases


nouns

pronouns


persons

organizations


statistically significant keywords

The observations above are akin to a set of descriptive statistics, and for more detail see the automatically generated summary page complete with more descriptive statistics and rudimentary bibliographics.

Example stories

Using the statistically significant keywords as query terms -- while ignoring things like "notre", "dame", "university", and "news' -- what sorts of articles are represented by the keywords? Below are some examples, and for the most part, the stories seem to be about people:

Search

Find news articles on your own with this (experimental) search interface:

query:
output format: HTML CSV JSON

Topic modeling

Another way to connote "aboutness" is to apply topic modeling. Succinctly stated, topic modeling is an unsupervised machine learning process used to enumerate latent themes in a set of texts. Given an integer (T), topic modeling will divide a set of texts into T subsets, and each subset can be likened as a theme or "topic". (By the way, there is no correct value for T, after all, how many things -- T -- is the whole of Shakespeare's writings about? That said, topic modeling can be quite informative.)

If I denote T equal to 1, then I can answer the question, "If there was one word used to describe the aboutness of the news corpus, then what would that word be?" After applying topic modeling, the answer is "research".

If I denote T equal to 4 and if I denote 4 words (features) to be used to elaborate upon the resulting four topics, then the result looks like this:

      topics  weights                                     features
    students  0.27181           students program school president 
   professor  0.22178  professor catholic institute international 
      people  0.15017                      people time years life 
    research  0.13037            research study professor science 

The story become even more interesting when T equals 16:

           labels  weights                                           features
         students  0.16036  students program education school programs sch...
           people  0.13684      people time life years world good think know 
        president  0.12150  president degree served board years vice gradu...
          lecture  0.10432  lecture conference author published professor ...
            south  0.08750  south community bend campus students local bui...
        professor  0.08170  professor book history american published stud...
         research  0.07842  research study shows data published social col...
            award  0.07240  award students graduate student awards researc...
              law  0.07149  law political professor court rights war justi...
         catholic  0.07021  catholic church holy faith life cross pope jen...
      engineering  0.06900  research engineering science professor faculty...
    international  0.06491  international institute peace global studies r...
             film  0.06018  film arts music art author published performin...
         football  0.04996  football jenkins athletics stadium irish game ...
           cancer  0.04220  research cancer health researchers study disea...
          physics  0.02319  physics stars research space star nuclear team...

The following pie chart illustrates the weights of each theme compared to the others. Notice how no single theme dominates:

Summary

In an effort to learn about the emphases of the University, I used computer technology to read the news. The results (above) are merely bunches o' observations. Interpretations of the observations ought to be accepted with more than a grain of salt. That said, if the news is representative of the University's emphases, then I assert the emphases are people and research, or more specifically: 1) people practicing Catholicism, and 2) research primarily in the areas of science, technology, engineering, and medicine.

This whole process could be improved in a number of ways. First of all, date values could be associated with each news story, and if they were, then trends could be observed over time, but alas, date values are not explicitly represented in the content's metadata. Second, much of the boilerplate content -- headers and footers -- could be removed before processing. This would make the output cleaner, but I doubt the result would change very much.

Fun with data science -- data science with words.


Eric Lease Morgan <[email protected]>
Navari Family Center for Digital Scholarship
University of Notre Dame

February 22, 2023