I was sitting at a client site the other day, watching our production logs scroll by. And then the client boss came by:
"You're just sitting there!"
"I'm watching the logs go by."
"Yeah, just sitting there!"
Not exactly. I'm learning from the system in a normal state. Understanding what's normal is the first step to figuring out what's wrong... when there's something wrong. For example, knowing that the widget didn't frobble and that must mean the frobbler crashed... well, I can only know that if I know that the widget normally frobbles, and specifically if I know that's in the log. If I didn't know it was usually there, I wouldn't notice its absence. To take another example, if I'm looking at normal logs and noticing that third party API calls are taking about 3-4 seconds, then there won't be any errors in the logs, just the usual timestamps and info messages. However, that might be a problem - maybe those API calls should be taking 1-2 seconds - even though the system is behaving "normally".
Take some time to watch the system as it behaves normally. Only by understanding what normally happens can you then figure out what is abnormal in a problem scenario.