Investigating the effects of representation learning on exploration in on-policy reinforcement learning

Reinforcement Learning (RL) in environments with high-dimensional state spaces is challenging. This is mainly due to the amount and quality of data required to adequately understand the environment, the consequences of actions, and to figure out high-value states/actions. Finding good actions and states, especially if they are sparse and/or there are long-term dependencies, is difficult. An RL agent must explore to find them all the while utilizing what it has learned. Additionally, the complexity of state and action spaces makes it challenging to generalize learned behaviors, requiring sophisticated function approximators and often leading to issues such as overfitting and sample inefficiency. Furthermore, the presence of noise in the data exacerbates these challenges. The effects of noise is more pronounced in high-dimensional spaces because the agent needs to discern meaningful patterns from noisy data, increasing the risk of overfitting to random fluctuations rather than true signals. Proper exploration is crucial for Reinforcement Learning problems as it can increase the sample efficiency and shorten the training time. Unguided exploration is very sample inefficient in high-dimensional settings. This is especially the case for the hard-exploration problems (e.g. Montezuma's Revenge) in which the agents struggle to learn due to the sparsity of the rewards and the complexity of the state and action spaces. There are several approaches for guided exploration, some of which are proposed to deal with the issues of hard-exploration problems. One of these methods is based on using "prediction-errors" as intrinsic rewards. In prediction-error based methods, a prediction (e.g. next state, reward) is compared against the actual observations. If the discrepancy between those two is high, one concludes that further exploration of such states is required to decrease the error. Exploration of these states is encouraged by providing extra rewards (intrinsic rewards) when the agent visits them. Such an approach adopts the optimism in the face of uncertainty principle by guiding the agent to the promising yet under-explored parts of the state space. However, in high-dimensional environments, unimportant observations and noise can lead the agent astray. One promising direction to alleviate these aforementioned issues in high-dimensional and noisy/stochastic environments is learning smaller yet effective and robust state representations. Such an ideal latent representation would be robust to noise and focus on the important aspects of the environment while ignoring the unimportant ones. Utilizing deep neural networks is already a step in this direction. Another potential step is borrowing auxiliary representation learning objectives from self-supervised learning to augment RL. In light of the observation that operating under small-dimensional state spaces is desirable for both the reinforcement learning agents and the exploration methods, we believe that for prediction-error based exploration methods, receiving support from representation learning methods appears as a viable solution. To this end we propose the Modified RND approach to investigate the effect of using an auxiliary self-supervised learning (SSL) loss for the model-predictive exploration methods. Additionally, we also propose the ViT with Explorative Attention method which aims to improve exploration performance by learning exploration and exploitation specific representations with just an architectural change without requiring any method from the self-supervised learning literature. Unfortunately, with our proposed methods we have failed to show justifiable performance gains. Only under certain circumstances we have managed to obtain better early training performance which later converged to the performance of our baseline models. Despite its short comings in empirical performance, we still believe that our work presents noteworthy ideas and serves to further one's understanding of the subject. We believe that our work may be a valuable tool to others who are also interested in the intersection of representation learning and prediction-error based on-policy exploration methods in reinforcement learning.
Yüksek boyutlu durum uzaylarına sahip ortamlarda pekiştirmeli öğrenme zordur. Bu, esas olarak, ortamı, eylemlerin sonuçlarını ve yüksek değerli durumları/eylemleri anlamak için gereken veri miktarı ve kalitesinden kaynaklanmaktadır. İyi eylemler ve durumlar bulmak, özellikle bunlar seyreklerse ve/veya uzun vadeli bağımlılıklar varsa zordur. Bir pekiştirmeli öğrenme ajanı, bunların hepsini bulmak için keşif yapmalı ve aynı zamanda öğrendiklerini kullanmalıdır. Ayrıca, bu uzayların karmaşıklığı öğrenilen davranışları genelleştirmeyi zorlaştırır, sofistike fonksiyon yaklaşıklayıcıları gerektirir ve genellikle aşırı öğrenme ve örnek verimsizliği gibi sorunlara yol açar. Ayrıca, verilerdeki gürültünün varlığı bu zorlukları daha da artırır. Gürültünün etkileri yüksek boyutlu uzaylarda daha belirgin hale gelir. Bunun sebebi ajanın büyük miktarda gürültülü veriden anlamlı örüntüleri ayırt etmek zorunda kalmasıdır ki bu da rastgele dalgalanmaları aşırı öğrenme riskini artırır. Pekiştirmeli öğrenme problemleri için doğru keşif çok önemlidir, çünkü örnek verimliliğini artırabilir ve eğitim süresini kısaltabilir. Yönlendirilmemiş keşif, özellikle yüksek boyutlu ortamlarda çok örnek verimsizdir. Bu, özellikle ödüllerin seyrekliği ve durum ve eylem uzaylarının karmaşıklığı nedeniyle ajanların öğrenmekte zorlandığı zor-keşif problemleri (örneğin Montezuma's Revenge) için geçerlidir. Yönlendirilmiş keşif için, zor-keşif problemlerinin sorunlarıyla başa çıkmak için önerilen birkaç yaklaşım vardır. Bu yöntemlerden biri, "tahmin-hataları"nı içsel ödüller olarak kullanmaya dayanmaktadır. Tahmin-hata tabanlı yöntemlerde, bir tahmin (örneğin, bir sonraki durum, ödül) gerçek gözlemlerle karşılaştırılır. Bu ikisi arasındaki farklılık yüksekse, bu tür durumların keşfinin hatayı azaltmak için gerektiği sonucuna varılır. Bu durumların keşfi, ajan bu durumları ziyaret ettiğinde ekstra ödüller (içsel ödüller) sağlanarak teşvik edilir. Böyle bir yaklaşım, ajanı durum uzayının vaatkar ama az keşfedilmiş kısımlarına yönlendirerek belirsizlik karşısında iyimserlik ilkesini benimser. Ancak, yüksek boyutlu ortamlarda, önemsiz gözlemler ve gürültü ajanı yanlış yönlendirebilir. Yukarıda bahsedilen yüksek boyutlu ve gürültülü/rastgele ortamlardaki sorunları hafifletmenin umut veren yöntemlerinden bir tanesi daha küçük ama etkili ve sağlam durum temsilleri öğrenmektir. Böyle ideal bir temsil, gürültüye karşı dayanıklı olacak ve ortamın önemsiz yanlarını göz ardı ederken ortamın önemli yönlerine odaklanacaktır. Derin sinir ağlarını kullanmak, zaten bu yönde bir adımdır. Diğer bir potansiyel adım ise, pekiştirmeli öğrenmeye öz gözetimli öğrenmedeki temsil öğrenme hedeflerini eklemektir. Küçük boyutlu durum uzaylarında çalışmanın hem pekiştirmeli öğrenme ajanları hem de keşif yöntemleri için arzu edildiği gözlemi ışığında, tahmin-hata tabanlı keşif yöntemleri için temsil öğrenme yöntemlerinden destek almanın geçerli bir çözüm olarak göründüğüne inanıyoruz. Bu amaçla, model öngörücü keşif yöntemleri için yardımcı öz gözetimli öğrenme hedefi kullanmanın etkisini araştırmak için Değiştirilmiş RND yaklaşımını öneriyoruz. Ek olarak, öz gözetimli öğrenme literatüründen herhangi bir yönteme ihtiyaç duymadan sadece mimari bir değişiklikle keşife ve sömürüye özgü temsilleri öğrenerek keşif performansını arttırmayı amaçlayan Keşifsel Dikkat ile ViT yöntemini de öneriyoruz. Ne yazık ki, önerdiğimiz yöntemlerle haklı performans artışları gösteremedik. Sadece belirli koşullar altında erken eğitim performansında daha iyi sonuçlar elde etmeyi başardık, ancak bu performans daha sonra denek modellerimizin performansına yakınsadı. Deneysel performanstaki eksikliklere rağmen, araştırmamızın kayda değer fikirler sunduğuna ve ilgili konuların daha iyi anlaşılmasına hizmet ettiğine inanıyoruz. Çalışmamızın temsil öğrenme ve politikalı pekiştirmeli öğrenmedeki tahmin hatasına dayalı keşif yöntemlerinin kesişimi ile ilgilenen diğerleri için değerli bir araç olabileceğini düşünüyoruz.

Publisher

Koç University

Subject

Reinforcement learning, Computational learning theory, Machine learning, Reinforcement, Learning classifier systems

URI

https://hdl.handle.net/20.500.14288/29739

Rights

restrictedAccess

Copyrights Note

Collections

Theses & Dissertations

Full item page

Publication: Investigating the effects of representation learning on exploration in on-policy reinforcement learning

Files

Departments

School / College / Institute

Program

KU-Authors

KU Authors

Co-Authors

Authors

Advisor

YÖK Thesis ID

Approval Date

Publication Date

Language

Type

Embargo Status

Journal Title

Journal ISSN

Volume Title

Alternative Title

Abstract

Source

Publisher

Subject

Citation

Has Part

Source

Book Series Title

Edition

DOI

URI

item.page.datauri

Link

Rights

Copyrights Note

Collections

Endorsement

Review

Supplemented By

Referenced By

0

Views

0

Downloads

Publication:
Investigating the effects of representation learning on exploration in on-policy reinforcement learning